A Clinical Evaluation of Two Commercially Available Deep-Learning Algorithms for Automated Organs at Risk Contouring 📝

Author: Steven DiBiase, Gurtej S. Gill, Haohua Billy Huang, Nicholas J. Lavini, Luxshan Shanmugarajah, Salar Souri, Samantha Wong 👨‍🔬

Affiliation: Stony Brook University, Northwell Health, Cornell University, NewYork-Presbyterian, New York-Presbyterian 🌍

Abstract:

Purpose: Clinical applications of deep learning-based algorithms have come to the radiation oncology field as organ at risk (OAR) auto contouring programs. We evaluated two of these algorithms’ (Radformation’s AutoContour and Therapanecea’s Annotate) accuracy, compared to physician approved organs, to see if their implementation would increase efficiency and speed, without sacrificing quality.
Methods: This retrospective study evaluated multiple OARs across different anatomical sites (A:head and neck, B:thorax, C:abdomen, and D:pelvis) using Radformation’s AutoContour and Therapanecea’s Annotate algorithms compared to contours approved by the physician. We calculated dice similarity coefficients to quantify analysis adding paired t-tests to access the statistical significance of differences.
Results: In Group A the mean DSC for the left parotid was 0.812 ± 0.030 for Therapanacea and 0.793 ± 0.051 for Radformation (p = 0.079). For the right parotid, Therapanacea outperformed Radformation with a mean DSC of 0.800 ± 0.067 versus 0.777 ± 0.070 (p = 0.006). The mandible exhibited a statistically significant improvement using Therapanacea’s contours 0.851 ± 0.050 versus Radformation’s 0.888 ± 0.024 (p = 0.009).
In Group B, Therapanacea’s lung contours achieved DSC values of 0.964 ± 0.018 and 0.973 ± 0.010, respectively, whereas Radformation’s contours had mean DSC values of 0.944 ± 0.019 and 0.959 ± 0.014 (p = 2.86E-05 and p =0.0013). The heart contours showed no statistically significant difference (p = 0.872).
No statistically significant differences were found for all structures in Group C (p = 0.798, p = 0.309, and p = 1,) and Group D (p = 0.434, p = 0.213 and p = 0.901).
Conclusion: Both systems easily integrated into the existing workflow without disruption. Even though isolated differences were found, there was no significant improvement of one over the other. Either algorithm could be successfully adopted while maintaining the high quality of work we take pride in.

Back to List