Author: Omar Awad, Alfredo Enrique Echeverria, Issam M. El Naqa, Daniel Allan Hamstra, Yiding Han, Ryan Lafratta, Abdallah Sherif Radwan Mohamed, Piyush Pathak, Zaid Ali Siddiqui, Baozhou Sun, Vincent Ugarte 👨🔬
Affiliation: H. Lee Moffitt Cancer Center, Harris Health, Baylor College of Medicine 🌍
Purpose:
Accurate detection and segmentation of brain metastases are critical for diagnosis, treatment planning, and follow-up imaging but are challenging due to labor-intensive manual assessments and inter-observer variability. Deep learning models often lack sensitivity and precision for small lesions and robustness across datasets. This study aims to enhance the robustness of AI-based segmentation models for brain metastases in pretreatment and follow-up MRI by optimizing loss functions and leveraging multi-dataset training.
Methods:
A DeepMedic-based network with a custom loss function – utilizing a sensitivity/specificity trade-off factor α, was trained on T1 post-contrast MRI datasets from two institutions (371 patients, 3416 lesions). Gamma Knife patient data (105 patients, 397 lesions) from a third institution with annotations from two physicians was used for testing. Nine models were developed using different datasets and α values to assess robustness. Performance metrics—sensitivity, precision, Dice similarity, and 95% Hausdorff distance—were evaluated using one physician’s contours as the reference.
Results:
For pretreatment data, using α=0.5 yielded the best F1-score (0.88) for lesions larger than 0.1 cc, with sensitivity (0.88±0.04), precision (0.87±0.04), Dice scores (0.7±0.03), and 95% Hausdorff distances (2.95±0.2mm). These were statistically indistinguishable from the second physician's performance. For lesions smaller than 0.1 cc, α=0.99 improved sensitivity (0.788 ± 0.11), although precision remained low (~0.1 ± 0.01). For follow-up data, the α=0.5 model exhibited slightly lower sensitivity and precision (at the edge of statistical significance) but achieved statistically better Dice scores (model vs. physician: 0.7 vs. 0.61, p=7e-3) and 95% Hausdorff distances (3 vs. 4.36mm, p=5e-12) than physician contours.
Conclusion:
Adjusting the detection loss and training data can enhance model accuracy and robustness. Our approach approximated expert-level performance for lesions >0.1 cc, and for follow-up data, it surpassed experts in segmentation quality. However, due to limited ground truth, further work is needed to detect very small lesions accurately.