Author: Laurence Edward Court, Raphael Douglas, David Fuentes, Anuja Jhingran, Barbara Marquez, Raymond Mumme, Christine Peterson, Julianne M. Pollard-Larkin, Surendra Prajapati, Dong Joo Rhee, Thomas J. Whitaker π¨βπ¬
Affiliation: MD Anderson Cancer Center, The University of Texas MD Anderson Cancer Center, MD Anderson, Department of Radiation Physics, The University of Texas MD Anderson Cancer Center π
Purpose: Safe deployment of auto-contouring models requires the inclusion of automated quality assurance (QA). One approach is to use an independent auto-contouring model and compare the contours geometrically for acceptability. This is not effective since geometric differences may not correlate to clinically significant errors (dose). Here we investigate whether a two-contour QA system is improved by including dose in this comparison.
Methods: VMAT plans were generated for 91 head and neck (H&N) patients and 50 cervical cancer (GYN) patients, using clinically-approved PTVs and auto-organs-at-risk (OARs) from a primary auto-contouring model. Dose to the primary auto-OARs were compared with dose to manually drawn and approved OARs (βthe truthβ). Differences of Dmean or Dmax β₯ 2Gy were identified as reporting errors (Derror). A second, independent auto-contouring model was then used to contour the OARs (verification). The primary and verification auto-contouring models were compared geometrically (DSC, sDSC, HD95, MSD) and dosimetrically (Dmean, Dmax). The ability of comparison metrics between the two auto-contours to flag actual dosimetric errors (i.e. primary model compared with the truth) was investigated. A logistic regression model was used to predict Derror β₯ 2 Gy. The data was divided into 50/50 stratified train/test; 10-fold cross validation was employed during training to avoid overfitting. H&N structures were divided into size-specific groups to improve model performance and generalizability.
Results:
Including dose metrics in the logistic regression model to predict Derror, mean increased ROC-AUC, AU-PRC by 0.19Β±0.04, 0.51Β±0.10 in the bootstrapped test set for H&N small structures. For Derror, max performance increased by 0.34Β±0.02, 0.50Β±0.04 (H&N small structures), 0.10Β±0.0, 0.16Β±0.01 (H&N medium structures), and 0.09Β±0.01, 0.07Β±0.02 (GYN structures).
Conclusion: Utilizing dose with geometric comparisons can improve the ability of a verification model to flag potential errors in a primary auto-contouring model.