Reasoning-Driven Prompts Improve EHR-Based Outcome Prediction and Clinical Interpretability in Large Language Models

Author: Shreyas Anil, Jason Chan, Arushi Gulati, Yannet Interian, Hui Lin, Benedict Neo, Andrea Park, Bhumika Srinivas 👨‍🔬

Affiliation: Department of Otolaryngology Head and Neck Surgery, University of California San Francisco, Department of Data Science, University of San Francisco, Department of Radiation Oncology, University of California San Francisco, University of San Francisco 🌍

Abstract:

Purpose: As Large Language Models (LLMs) continue to evolve, their ability to analyze Electronic Health Record (EHR) notes for clinical decision support expands. Chain of Thought (COT) reasoning, an emergent property of LLMs, has shown its potential to enhance complex reasoning tasks. This study evaluates the effectiveness of CoT reasoning for both information retrieval and survival outcome prediction using a head and neck cancer patient cohort, comparing a commercial LLM, Claude 3.5-Sonnet and an open-source LLM, MegaBeam-Mistral-7B.

Methods: We aggregated unstructured EHR physician notes from diagnosis up to 365 days post-diagnosis, predicting 5-year survival outcomes (alive/deceased) for 200 patient samples. We tested three prompting strategies: (1) zero-shot prompting, (2) self-generated CoT, where models iteratively extract key clinical insights, and (3) a structured template-based CoT, guiding models to extract tumor location, treatment plan, TNM staging, chemotherapy status, and readmission history and used them as prompts. We compared information retrieval accuracy and classification performance across prompting strategies.

Results: Claude 3.5-Sonnet consistently outperformed MegaBeam-Mistral-7B in both extraction accuracy and reasoning quality. The structured template approach yielded the highest classification performance (accuracy = 0.70, F1 = 0.76 for Claude-3.5-Sonnet), surpassing both zero-shot prompting (accuracy = 0.62, F1 = 0.71) and self-generated CoT (accuracy = 0.66, F1 = 0.74). Information retrieval accuracy also improved with structured templates, with Claude-3.5-Sonnet achieving 100% accuracy for tumor location and surgery type, compared to 72% and 81% for MegaBeam-Mistral-7B.

Conclusion: This study demonstrates that structured CoT reasoning enhances both predictive accuracy and clinical factor extraction from EHR notes. By incorporating templated reasoning prompts, LLMs can achieve more reliable and interpretable predictions.

Reasoning-Driven Prompts Improve EHR-Based Outcome Prediction and Clinical Interpretability in Large Language Models 📝

Abstract: