A novel approach to interpretable and efficient medical image analysis using weakly supervised learning.
Below are examples of heatmaps generated by INSIGHT, which highlight diagnostically relevant regions in whole-slide images (WSIs). Our method achieves this using only WSI-level labels, making it both efficient and interpretable without requiring costly pixel-level annotations.
The rapid growth of medical imaging data has presented significant challenges for developing diagnostic systems that are both accurate and interpretable. Traditional methods often rely on fully supervised approaches that require dense annotations, which are labor-intensive and costly to obtain. Moreover, existing aggregators, such as those based on multiple-instance learning (MIL), struggle to achieve a balance between classification accuracy and spatial calibration. While they can identify regions of interest, they typically depend on post-hoc visualization methods like Grad-CAM to generate interpretable outputs. This reliance on external tools introduces additional complexity and fails to integrate interpretability as a core feature of the model.
INSIGHT (Integrated Network for Segmentation and Interpretation with Generalized Heatmap Transmission) is a novel framework designed to analyze large-scale medical images, such as whole-slide pathology images (WSIs) and volumetric CT scans, while maintaining interpretability for clinicians. It addresses the limitations of traditional methods by embedding interpretability directly into its architecture, eliminating the need for post-hoc visualization tools like Grad-CAM. INSIGHT combines fine-grained local feature detection with broader contextual awareness through two key modules: the Detection Module, which captures small, diagnostically critical details, and the Context Module, which suppresses irrelevant activations by incorporating global contextual information. This design enables INSIGHT to generate heatmaps that closely align with ground-truth diagnostic regions, offering both accuracy and transparency. By requiring only image-level labels, INSIGHT significantly reduces the annotation burden while delivering state-of-the-art classification and weakly supervised segmentation performance.
INSIGHT's inputs and architecture: (a) Images are pre-processed to extract pre-trained features from each CT slice or WSI patch. (b) These features are processed through the Detection and Context modules to generate slice- or patch-level heatmaps by incorporating both fine-grained details and broader contextual information. (c) Heatmaps are aggregated across slices or patches to produce a binary prediction for each category along with interpretable heatmaps.
Below is the comparison of INSIGHT with other models across CAMELYON16, BRACS, and MosMed datasets. The results showcase classification AUC and segmentation Dice metrics.
Aggregator | CAMELYON16 | BRACS | |||||
---|---|---|---|---|---|---|---|
AUC | Dice | ADH | FEA | DCIS | Invasive | Macro AUC | |
ABMIL | 0.975 | 55.8 ± 25.0 | 0.656 | 0.744 | 0.804 | 0.995 | 0.800 |
CLAM-SB | 0.966 | 64.7 ± 24.1 | 0.611 | 0.757 | 0.833 | 0.999 | 0.800 |
CLAM-MB | 0.973 | 67.7 ± 22.6 | 0.701 | 0.687 | 0.828 | 0.998 | 0.804 |
TransMIL | 0.982 | 12.4 ± 22.4 | 0.644 | 0.653 | 0.769 | 0.989 | 0.764 |
INSIGHT(Ours) | 0.990 | 74.6 ± 19.1 | 0.734 | 0.790 | 0.837 | 0.999 | 0.840 |
This work was supported in part by NSF award #2326491. The views and conclusions contained herein are those of the authors and should not be interpreted as the official policies or endorsements of any sponsor. We thank Jhair Gallardo and Shikhar Srivastava for their comments on early drafts.