Evaluating the Performance of Using Large Language Models to Automate Summarization of CT Simulation Orders in Radiation Oncology

Author: Meiyun Cao, Edward L. Clouser, Xiaoning Ding, Jason Michael Holmes, Shaw Hu, Linda L. Lam, Wendy S. Lindholm, Wei Liu, Samir H. Patel, Diego Santos Toesca, Jason Sharp, Sujay A. Vora, Peilong Wang 👨‍🔬

Affiliation: Department of Radiation Oncology, Mayo Clinic, Mayo Clinic Arizona, George Washington University 🌍

Abstract:

Purpose: In current clinical workflow of radiation oncology departments, therapists manually summarize CT simulation orders into summaries before the CT simulation for execution. This process significantly increases workload, introduces variability in documentation quality, and is prone to human errors. To address these challenges, this study aims to use a large language model (LLM) to automate the generation of summaries from the CT simulation orders and evaluate its performance.
Methods: A total of 607 patients’ CT simulation orders were collected from the Aria database at our institution. A locally hosted Llama 3.1 405B model, accessed via the Application Programming Interface (API) service, was used to extract keywords from the CT simulation orders and generate summaries. The downloaded CT simulation orders were categorized into seven groups based on treatment modalities and disease sites. For each group, a customized instruction prompt was developed collaboratively with therapists to guide the Llama 3.1 405B model in generating summaries. The ground truth for the corresponding summaries was manually derived by carefully reviewing each CT simulation order and subsequently verified by therapists. The accuracy of the LLM-generated summaries was evaluated by therapists using the verified ground truth as a reference.
Results: Over 98% of the LLM-generated summaries matched the manually generated ground truth in terms of accuracy. Our evaluations showed an improved consistency in format and enhanced readability of the LLM-generated summaries compared to the corresponding therapists-generated summaries. This automated approach demonstrated a consistent performance across all groups, regardless of modality or disease site.
Conclusion: This study demonstrated high accuracy and consistency of the Llama 3.1 405B model in extracting keywords and summarizing CT simulation orders, suggesting that LLMs hold great potential in assisting with this task, reducing therapists’ workload, and improving workflow efficiency.

Evaluating the Performance of Using Large Language Models to Automate Summarization of CT Simulation Orders in Radiation Oncology 📝

Abstract: