Author: Hassan Bagher-Ebadian, Indrin J. Chetty, Mohamed Elshaikh, Ahmed I Ghanem, Mohammad M. Ghassemi, Reza Khanmohammadi, Benjamin Movsas, Shayan Siddiqui, Kundan S Thind, Jawad Turfa 👨🔬
Affiliation: Michigan State University, Department of Radiation Oncology,Cedars-Sinai Medical Center, Department of Radiation Oncology, Henry Ford Health-Cancer, Detroit, MI and Alexandria Department of Clinical Oncology, Faculty of Medicine, Alexandria University, Henry Ford Health 🌍
Purpose: Extracting late radiotherapy-induced toxicities from free-text notes using natural language processing is complicated by negative symptom identification, computational demands, and data privacy. This study introduces a novel parameter-efficient fine-tuning method for compact language models, using Low-Rank Adaptation (LoRA) and Chain-of-Thought prompting to improve accuracy and efficiency while maintaining data privacy.
Methods: Two Llama-based models (3.2-3B and 3.1-8B) were fine-tuned to extract long-term toxicities from 5,848 expert-labeled clinical notes of 100 prostate cancer patients who received 78-79.2 Gy /39-44 fractions definitive radiation therapy between 2017-2021. LoRA with a rank of 128 was applied, targeting attention and feed-forward layers for efficient parameter tuning and continual learning. Chain-of-Thought prompting was incorporated to improve reasoning during toxicity classification. Five-fold stratified cross-validation was performed with splits of 4,675 training, 584 validation, and 589 testing samples. Models were evaluated for precision, recall, and F1 scores, focusing on negative and positive toxicity symptoms, with statistical significance tested using the Wilcoxon signed-rank test.
Results: For the 3.1-8B model, precision, recall, and F1 scores for negative classifications improved from 0.52 [0.49-0.56], 0.90 [0.83-0.91], and 0.64 [0.60-0.70] to 0.98 [0.95-1.00], 0.94 [0.91-0.95], and 0.93 [0.91-0.96], respectively. For positive classifications, precision, recall, and F1 scores increased from 0.83 [0.80-0.85], 0.89 [0.87-0.91], and 0.85 [0.83-0.87] to 0.93 [0.90-0.97], 1.00 [0.95-1.00], and 0.95 [0.93-0.96], respectively. The 3.2-3B model showed similar improvements, with F1 scores for negative classifications rising from 0.48 [0.44- 0.52] to 0.87 [0.81-0.91], and for positive classifications from 0.63 [0.60-0.68] to 0.83 [0.76-0.85]. All improvements were statistically significant (p<0.05, Wilcoxon signed-rank test).
Conclusion: This novel fine-tuning approach significantly improves compact language model performance in extracting radiotherapy-induced toxicities, particularly for negative toxicity symptoms. This efficient method provides a privacy-preserving solution for automated toxicity extraction and monitoring in radiation oncology.