Using Open-Source Reasoning Large Language Models for Radiotherapy Structure Name Harmonization 📝

Author: Claus Belka, Stefanie Corradini, Christopher Kurz, Guillaume Landry, Matteo Maspero, Adrian Thummerer, Erik van der Bijl 👨‍🔬

Affiliation: Department of Radiation Oncology, LMU University Hospital, LMU Munich, Radboud University Medical Center, UMC Utrecht 🌍

Abstract:

Purpose: To automatically harmonize non-standardized organ-at-risk (OAR) structure names from multi-lingual, multi-institutional radiotherapy datasets using state-of-the-art open-source reasoning large language models (LLMs), thereby improving the explainability of model decisions.
Methods: As part of the upcoming SynthRAD2025 deep learning challenge, radiotherapy structure sets from 90 head-and-neck, lung, and abdominal cancer patients were collected from three university medical centers (UMC Utrecht, Netherlands; Radboud UMC, Netherlands; LMU Klinikum Munich, Germany). The dataset was filtered to exclude target structures, structures not included in the AAPM TG263 guideline (e.g. immobilization devices) and duplicates, resulting in 163 unique, non-standardized OAR names in German, English, or Dutch. A locally deployed, state-of-the-art reasoning LLM (DeepSeek-R1-Distill-Llama-70B) was prompted to rename the structures according to the TG263 guideline. The prompt included general, non-center-specific instructions, a list of standardized TG263 structure names for the relevant anatomical regions, a few example outputs, and the original institution-assigned names. For evaluation, the LLM-generated names were compared against ground-truth labels assigned by medical physicists and those generated by a conventional LLM without reasoning capabilities but with the same number of parameters (Llama 3.1-70B Instruct). Accuracy was calculated as the percentage of correctly renamed structures. Additionally, reasoning outputs were analyzed for failure cases.
Results: The reasoning LLM (DeepSeek R1) achieved a renaming accuracy of 96.9%, compared to 92.4% for the non-reasoning LLM (Llama3.1). On average, the reasoning LLM required 25.2 seconds per structure, whereas the non-reasoning LLM required only 1.2 seconds.
Conclusion: The reasoning LLM demonstrated very high accuracy in renaming non-standardized, multi-lingual radiotherapy structures according to the AAPM TG263 guideline. Additionally, its chain-of-thought output improves the explainability of assigned structure names, facilitating further refinements and potential prompt fine-tuning for future applications. Due to significantly more generated output tokens during inference, the reasoning LLM is slower than the non-reasoning LLM.

Back to List