Image & Video

New FOCI Method Reveals Compact Rationales for WSI-MIL Models

Researchers find that 32-56% fewer tiles can recover full slide predictions.

Deep Dive

Whole-slide image (WSI) multiple instance learning (MIL) models often achieve strong slide-level AUC but remain opaque. Attention scores are commonly used as explanations, but high attention can reflect aggregation preferences rather than a truly compact model rationale. Researchers from Seoul National University and Korea University introduce FOCI (Finding Optimal Contextual Instances), a lightweight readout layer that can be plugged into any frozen MIL backbone. FOCI is trained with sufficiency and exclusion objectives over keep/drop tile subsets, and evaluated using an insertion-style Sequential Reveal Protocol (SRP) adapted to WSI-MIL. The key metric is the Selection Headroom Index (SHI), which measures how much tile reduction is possible while preserving prediction accuracy.

Across three WSI benchmarks (e.g., Camelyon16, TCGA) and seven MIL backbones (including TransMIL, ACMIL, and attention-pooling variants), FOCI reveals that compact rationales are highly dependent on the architecture. For TransMIL, FOCI reduces the Minimum Sufficient K (MSK) tile count by 32-56% compared to standard CLS-proxy ranking. ACMIL with FOCI achieves the highest average SHI (+0.465). Transformer and multi-branch attention aggregators readily admit compact rationales, while hard-selection and minimal attention-pooling backbones hit a saturation regime. The authors stress that identified tiles are candidate rationales for model-level audit, not clinical diagnostic claims. FOCI positions itself as an interpretability tool for verifying when a frozen MIL prediction can be localized to a small, output-consistent subset of tiles.

Key Points
  • FOCI reduces tile count by 32-56% for TransMIL across WSI benchmarks, enabling localized rationale extraction.
  • ACMIL + FOCI achieves the highest Selection Headroom Index (SHI) of +0.465, indicating strong compactness.
  • Transformer and multi-branch attention backbones allow compact rationales; hard-selection models conflict with external readout.

Why It Matters

Makes AI pathology models more interpretable by localizing predictions to minimal tile subsets for audit.