A novel approach to interpretable and efficient medical image analysis using weakly supervised learning.
Below are examples of heatmaps generated by INSIGHT, which highlight diagnostically relevant regions in whole-slide images (WSIs). Our method achieves this using only WSI-level labels, making it both efficient and interpretable without requiring costly pixel-level annotations.
Example WSI heatmap
The rapid growth of medical imaging data has presented significant challenges for developing diagnostic systems that are both accurate and interpretable. Traditional methods often rely on fully supervised approaches that require dense annotations, which are labor-intensive and costly to obtain. Moreover, existing aggregators, such as those based on multiple-instance learning (MIL), struggle to achieve a balance between classification accuracy and spatial calibration. While they can identify regions of interest, they typically depend on post-hoc visualization methods like Grad-CAM to generate interpretable outputs. This reliance on external tools introduces additional complexity and fails to integrate interpretability as a core feature of the model.
INSIGHT (Integrated Network for Segmentation and Interpretation with Generalized Heatmap Transmission) is a novel framework designed to analyze large-scale medical images, such as whole-slide pathology images (WSIs) and volumetric CT scans, while maintaining interpretability for clinicians. It addresses the limitations of traditional methods by embedding interpretability directly into its architecture, eliminating the need for post-hoc visualization tools like Grad-CAM. INSIGHT combines fine-grained local feature detection with broader contextual awareness through two key modules: the Detection Module, which captures small, diagnostically critical details, and the Context Module, which suppresses irrelevant activations by incorporating global contextual information. This design enables INSIGHT to generate heatmaps that closely align with ground-truth diagnostic regions, offering both accuracy and transparency. By requiring only image-level labels, INSIGHT significantly reduces the annotation burden while delivering state-of-the-art classification and weakly supervised segmentation performance.
Overview of the INSIGHT framework: (a) Input images (WSIs or CT volumes) are preprocessed and transformed into spatial embeddings using a pretrained encoder. (b) Each slice or patch embedding is processed by INSIGHT, which consists of a detection module to capture fine-grained signals and a context sup- pression module to reduce false positives. This produces patch-level heatmaps for each category, which are then aggregated to generate whole-slide or volume-level heatmaps. Final predictions for each category are obtained via SmoothMax pool- ing over the heatmaps. Throughout this process, spatial resolution is preserved, encouraging the model to produce reliable and interpretable heatmaps.
Below is the comparison of INSIGHT with other models across CAMELYON16, BRACS, and MosMed datasets. The results showcase classification AUC and segmentation Dice metrics.
Aggregator | CAMELYON16 | BRACS | ||||
---|---|---|---|---|---|---|
AUC | Dice | ADH | FEA | DCIS | Invasive | |
ABMIL | 0.975 | 55.8 ± 25.0 | 0.656 | 0.744 | 0.804 | 0.995 |
CLAM-SB | 0.966 | 64.7 ± 24.1 | 0.611 | 0.757 | 0.833 | 0.999 |
CLAM-MB | 0.973 | 67.7 ± 22.6 | 0.701 | 0.687 | 0.828 | 0.998 |
TransMIL | 0.982 | 12.4 ± 22.4 | 0.644 | 0.653 | 0.769 | 0.989 |
WiKG | 0.967 | 66.1 ± 24.2 | 0.454 | 0.653 | 0.771 | 0.990 |
INSIGHT(Ours) | 0.990 | 74.6 ± 19.1 | 0.734 | 0.790 | 0.837 | 0.999 |
This work was supported in part by NSF award #2326491. The views and conclusions contained herein are those of the authors and should not be interpreted as the official policies or endorsements of any sponsor. We thank Jhair Gallardo and Shikhar Srivastava for their comments on early drafts.