Attention-Based Multiple Instance Learning of Head and Neck Cancer Grading on Digital Pathology Using Vision-Language Foundational Models πŸ“

Author: Kyle J. Lafata, Xiang Li, Megan K. Russ, Zion Sheng πŸ‘¨β€πŸ”¬

Affiliation: Duke University, Department of Radiation Oncology, Duke University, Clinical Imaging Physics Group, Department of Radiology, Duke University Health System 🌍

Abstract:

Purpose: To adapt Vision-Language Foundational Models (VLFM) to perform HNSCC tumor grading on H&E whole slide images (WSI) via attention-based multiple instance learning (ABMIL).
Methods: We utilized 140 digital pathology H&E WSIs (magnification = 20x) of HNSCC from the CPTAC-HNSCC dataset. Samples were equally distributed across four classes: tumor grade G1, G2, G3, and non-cancerous (NC). A weakly supervised learning approach called ABMIL was employed to fine-tune the selected SOTA VLFMs in pathology, such as CONCH and PLIP. Each input WSI was divided into a bag of patches, processed, and encoded by the VLFM visual encoder. Then, the trainable attention module picked the region of interest by assigning different attention scores to each patch embedding. The scores and embeddings were aggregated and mapped to a probability score for each class. To evaluate the zero-shot and fine-tuned performance, we computed their tumor grading accuracy and F1 scores on the test set.
Results: Each model fine-tuning procedure was completed in just 16 hours using a single GPU. Our top-performing candidate, a CONCH backbone fine-tuned by ABMIL, achieved an overall F1 score of 75% on the test set and scored over 90% in certain grade groups. It outperformed the zero-shot approach by 35% and baseline methods by as much as 50%. The model’s attention heatmaps indicated that it successfully identified and focused on subregions associated with each tumor grade. Moreover, the model has demonstrated its robustness to staining variations, as its performance was unaffected by the presence or absence of staining normalization.
Conclusion: The ABMIL fine-tuned VLFMs showed promising efficacy, accuracy, and interpretability in grading HNSCC tumors while maintaining a cost-effective training process.

Back to List