INSIGHT: Explainable Weakly-Supervised Medical Image Analysis

A novel approach to interpretable and efficient medical image analysis using weakly supervised learning.

Wenbo Zhang

University of Rochester

Junyu Chen

University of Rochester

Christopher Kanan

University of Rochester

Visualization

Below are examples of heatmaps generated by INSIGHT, which highlight diagnostically relevant regions in whole-slide images (WSIs). Our method achieves this using only WSI-level labels, making it both efficient and interpretable without requiring costly pixel-level annotations.

Motivation

The rapid growth of medical imaging data has presented significant challenges for developing diagnostic systems that are both accurate and interpretable. Traditional methods often rely on fully supervised approaches that require dense annotations, which are labor-intensive and costly to obtain. Moreover, existing aggregators, such as those based on multiple-instance learning (MIL), struggle to achieve a balance between classification accuracy and spatial calibration. While they can identify regions of interest, they typically depend on post-hoc visualization methods like Grad-CAM to generate interpretable outputs. This reliance on external tools introduces additional complexity and fails to integrate interpretability as a core feature of the model.

About INSIGHT

INSIGHT (Integrated Network for Segmentation and Interpretation with Generalized Heatmap Transmission) is a novel framework designed to analyze large-scale medical images, such as whole-slide pathology images (WSIs) and volumetric CT scans, while maintaining interpretability for clinicians. It addresses the limitations of traditional methods by embedding interpretability directly into its architecture, eliminating the need for post-hoc visualization tools like Grad-CAM. INSIGHT combines fine-grained local feature detection with broader contextual awareness through two key modules: the Detection Module, which captures small, diagnostically critical details, and the Context Module, which suppresses irrelevant activations by incorporating global contextual information. This design enables INSIGHT to generate heatmaps that closely align with ground-truth diagnostic regions, offering both accuracy and transparency. By requiring only image-level labels, INSIGHT significantly reduces the annotation burden while delivering state-of-the-art classification and weakly supervised segmentation performance.

INSIGHT Architecture

Overview of the INSIGHT framework: (a) Input images (WSIs or CT volumes) are preprocessed and transformed into spatial embeddings using a pretrained encoder. (b) Each slice or patch embedding is processed by INSIGHT, which consists of a detection module to capture fine-grained signals and a context sup- pression module to reduce false positives. This produces patch-level heatmaps for each category, which are then aggregated to generate whole-slide or volume-level heatmaps. Final predictions for each category are obtained via SmoothMax pool- ing over the heatmaps. Throughout this process, spatial resolution is preserved, encouraging the model to produce reliable and interpretable heatmaps.

Quantitative Performance

Below is the comparison of INSIGHT with other models across CAMELYON16, BRACS, and MosMed datasets. The results showcase classification AUC and segmentation Dice metrics.

CAMELYON16 & BRACS

Aggregator CAMELYON16 BRACS
AUC Dice ADH FEA DCIS Invasive
ABMIL 0.975 55.8 ± 25.0 0.656 0.744 0.804 0.995
CLAM-SB 0.966 64.7 ± 24.1 0.611 0.757 0.833 0.999
CLAM-MB 0.973 67.7 ± 22.6 0.701 0.687 0.828 0.998
TransMIL 0.982 12.4 ± 22.4 0.644 0.653 0.769 0.989
WiKG 0.967 66.1 ± 24.2 0.454 0.653 0.771 0.990
INSIGHT(Ours) 0.990 74.6 ± 19.1 0.734 0.790 0.837 0.999

MosMed

Task Model Setting AUC / Dice
Classification PR-3D-CNN Five-fold CV 0.914 ± 0.049
INSIGHT Five-fold CV 0.962 ± 0.012
Segmentation 3D U-Net Voxel-level 40.5 ± 21.3
3D GAN Volume-level 41.2 ± 14.7
INSIGHT Volume-level 42.7 ± 15.3

Acknowledgments

This work was supported in part by NSF award #2326491. The views and conclusions contained herein are those of the authors and should not be interpreted as the official policies or endorsements of any sponsor. We thank Jhair Gallardo and Shikhar Srivastava for their comments on early drafts.

↑ Back to Top