Foundation Model-Augmented Learning for Automatic Delineation in Precision Radiotherapy 📝

Author: Xianjin Dai, PhD, Michael Gensheimer, Praveenbalaji Rajendran, Lei Xing, Yong Yang 👨‍🔬

Affiliation: Department of Radiation Oncology, Stanford University, Massachusetts General Hospital, Harvard Medical School 🌍

Abstract:

Purpose: Recent advances in the automatic delineation of radiotherapy treatment targets, which incorporate linguistic clinical data extracted by large language models (LLMs) into traditional visual-only segmentation, show great potential for improving target delineation accuracy. However, further enhancements in accuracy are required for clinical deployment. This study aims to address this challenge through fundamental innovations in model architecture based on general-purpose foundation models to improve target delineation accuracy.

Methods: We developed a novel model, named Segformer, featuring a transformer-based encoder and decoder built on the SWIN transformer framework to extract visual features from 3D CT scans. Linguistic features from clinical data were incorporated through a visual-language attention module. These linguistic features were extracted using LLMs applied to extensive patient clinical records. The proposed method was evaluated on a retrospective cohort of 2,985 cancer patients, including those with oropharyngeal, larynx, nasopharynx, hypopharynx, nasal cavity, and oral cavity cancers. Segformer’s performance was compared to Segformer without linguistic inputs and state-of-the-art models, including Radformer and SWIN-UNetR, using Dice Similarity Coefficient (DSC), Intersection over Union (IOU), and Hausdorff Distance (HD95).

Results: Segformer demonstrated superior performance, achieving a DSC of 0.78±0.10, IOU of 0.70±0.09, and HD95 of 6.42±5.9. These results significantly outperformed Segformer without linguistic inputs (DSC: 0.73±0.12, IOU: 0.67±0.10, HD95: 12.35±11.62), Radformer with LAVE (DSC: 0.76±0.09, IOU: 0.69±0.08, HD95: 7.82±6.87), and SWIN-UNetR (DSC: 0.69±0.11, IOU: 0.64±0.09, HD95: 12.88±6.60). Statistical analysis using paired t-tests confirmed significant improvements (p < 0.05) in segmentation performance. Violin plot distributions and segmentation visualizations further validated Segformer’s ability to produce accurate and clinically relevant delineations.

Conclusion: The proposed Segformer network demonstrates a significant improvement in the delineation accuracy of radiotherapy treatment targets. By incorporating linguistic features alongside visual data, Segformer achieves superior performance compared to state-of-the-art methods. This approach paves the way for more precise and personalized radiotherapy workflows.

Back to List