Integrating Multiple Modalities with Pretrained Swin Foundation Model for Head and Neck Tumor Segmentation

Author: Jue Jiang, Aneesh Rangnekar, Shiqin Tan, Harini Veeraraghavan 👨‍🔬

Affiliation: Department of Medical Physics, Memorial Sloan Kettering Cancer Center, Weill Cornell Graduate School of Medical Sciences 🌍

Abstract:

Purpose: Clinicians often use information from FDG-PET and CT to interpret and delineate gross tumor (GTVp) and nodal (GTVn) volumes for radiotherapy planning in head and neck (HN) cancer patients. Hence, in this study, we aimed to create a deep learning model (XLinker) that optimally integrates information from multiple modalities to segment volumes of interest in HN cancer patients.

Methods: A total of 524 FDG-PET and CT scans of patients with HN cancers containing GTVp and GTVn provided through the HEad and neCK TumOR (HECKTOR) 2022 challenge were used in the analysis. Our model, XLinker, uses a hierarchical shifted windows (Swin) transformer encoder for each modality with a convolutional U-Net style decoder. The encoder, pretrained with 10,412 unlabeled CT scans of patients with diverse diseases, was used to extract the features separately from the FDG-PET and CT scans. Dual cross attention was implemented at Stage 3 to allow feature aggregation. Five fold cross validation using Dice and cross-entropy loss was applied after splitting data into 80% training: 20% testing. Best models from each fold were used as an ensemble to generate final segmentation on the test set. Comparisons were performed by using PET and CT as two channels, cross attention using stage 3 and/or 4, randomly initialized Swin, and PET only models.

Results: XLinker produced the best aggregated Dice similarity coefficient (DSC) of 0.811 for GTVp and 0.762 for GTVn. In comparison PET only model (GTVp 0.792, GTVn 0.719), model initialized with no pretraining (GTVp 0.806, GTVn 0.721), and two-channel model (GTVp 0.808, GTVn 0.752) were somewhat less accurate.

Conclusion: Our analysis indicates XLinker was capable of producing reasonably accurate automated segmentation of H&N cancers by combining FDG-PET and CT scans.

Integrating Multiple Modalities with Pretrained Swin Foundation Model for Head and Neck Tumor Segmentation 📝

Abstract: