Author: Junwen Liu, Mengzhen Wang, Ning Wen, Jifeng Xiao, Fuhua Yan, Yanzhao Yang, Xuekun Zhang, Zheyu Zhang π¨βπ¬
Affiliation: Department of Radiology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Department of Radiology, Ruijin Hospital Shanghai Jiaotong University School of Medicine, Shanghai Jiaotong University, The SJTU-Ruijin-UIH Institute for Medical Imaging Technology, Shanghai Jiaotong University Schoo of Medicine π
Purpose:This study aims to develop and evaluate a large language model (LLM) fine-tuned to generate consistent and accurate impressions from imaging findings. Additionally, the study investigates the LLMβs impact on workflow efficiency, diagnostic accuracy, and computational optimizations during large-scale model fine-tuning.
Methods: We performed full-network Supervised Fine-tuning (SFT) on an LLM using a diverse dataset of 603,943 abdominal CT reports collected over 12 years from our hospital. This dataset was rigorously preprocessed, resulting in 520,442 reports, 70% of which of these were used for SFT. Evaluation was conducted on 20,000 cases, focusing on 1,375 with the largest discrepancies between the LLM-generated outputs and the ground truth, as measured by ROUGE-L F1 scores. Experiment setups involved leveraging multi-GPU servers to fine-tune the model, partial of which with optimizations such as ZeRO-3 and ZeRO-Offload mode for memory efficiency. Evaluation from 14 professional radiologists was used to assess the model's practical utility.
Results: Among the 1,375 cases evaluated, 17.1% demonstrated complete semantic alignment between the LLM-generated answers and the ground truth, indicating a baseline consistency in both outputs. However, the ground truth was selected as the best answer in only 39.5% of cases, while being rated as the worst in 22.9%. In contrast, the LLM-generated outputs were selected as the best answer in 78.1% of evaluations, reflecting their superior reliability and robustness. These findings highlight the modelβs ability to mitigate the limitations of ground truth data and further demonstrates the LLM's effectiveness in generating high-quality, impressions.
Conclusion: This study shows that full-network SFT LLMs on a comprehensive and carefully curated real-life clinical dataset, can produce reliable impressions, and enhancing the quality of radiology reports. The optimizations implemented in training, including multi-GPU setups with ZeRO optimizations, offering valuable insights for balancing computational efficiency with resource availability when fine-tuning large-scale models for medical applications.