Demographic Attributes of the Train-Test Sets and Their Impact on AI Performance: Medical Imaging Applications

Author: Maryellen L. Giger, Fahd Hatoum, Robert Tomek, Heather M. Whitney 👨‍🔬

Affiliation: The University of Chicago 🌍

Abstract:

Purpose: To assess the importance of applying stratified sampling across demographic attributes (including age, sex, race, and ethnicity) when constructing training and testing datasets for ML-based disease classification.
Methods: An ML classifier capable of diagnosing COVID-19 using patients’ chest radiograph was used. The classifier was trained on an N=50,000 chest radiograph dataset. This dataset was split into training (80% of cases) and testing (20% of cases) subsets using either stratified sampling or random sampling. The characteristics considered for stratification were age, race, ethnicity, sex and COVID-19 status (Positive or Negative). To calculate the similarity between the 2 subsets produced by each method, we created a multidimensional version of the Jensen-Shannon Distance (JSD) that combines the individual JSD for each of the demographic attributes and disease state into one single score. Each dataset sampling method was repeated N=200 times to create train-test splits that had differing degrees of similarity. For each of these train-test splits, the performance of the classifier was quantified by evaluating the area under the ROC curve. For each method, the JSD was then plotted against the AUC for all 200 trials.
Results: We found that stratified sampling using demographic attributes allows the classifier to become more consistent in its predictions. This was concluded as the range of AUC values corresponding to the low JSD score (stratified sampling) goes from 0.70 to 0.73 while the AUC values corresponding to the higher JSD score (random sampling) goes from 0.66 to 0.77.
Conclusion: These results indicate that stratified sampling across demographic attributes (in addition to the disease state) is important when training and testing AI models for medical purposes. In addition, these results suggest that such frameworks might be beneficial in order to ensure that generalizable and "fair" AI is developed by researchers.

Demographic Attributes of the Train-Test Sets and Their Impact on AI Performance: Medical Imaging Applications 📝

Abstract: