Research & Papers

SoftBlobGIN makes ESM-2 protein predictions transparent with graph partitioning

A lightweight GNN layer reveals hidden catalytic sites without retraining the language model

Deep Dive

Protein language models like Meta's ESM-2 have become powerful tools for predicting protein function, but their dense latent representations mix structural and evolutionary signals in ways that are notoriously hard to interpret. A new paper from Siddhant Dutta and collaborators (submitted to NeurIPS 2026) introduces SoftBlobGIN, a lightweight Graph Isomorphism Network that acts as a structural companion to ESM-2. The framework projects ESM-2 features onto protein contact graphs and applies differentiable Gumbel-softmax substructure pooling to learn coarse functional substructures. This allows the model to perform structure-aware message passing while producing directly auditable explanations via tools like GNNExplainer.

On enzyme classification, SoftBlobGIN achieves 92.8% accuracy with a macro-F1 of 0.898. More importantly, it dramatically improves interpretability: for binding-site detection, residue AUROC jumps from 0.885 (with a standard ESM-2 linear probe) to 0.983, indicating that structural explanations are not recoverable from language-model features alone. The learned blob partitions automatically group residues into functional clusters, with active-site-containing blobs showing 1.85x higher importance than others (ρ=0.339, p=0.009) — all without any active-site supervision. The framework generalizes across ProteinShake tasks, achieving F_max of 0.733 on Gene Ontology prediction and AUROC of 0.969 on binding-site detection. With only ~1.1M added parameters and no retraining of the base language model, SoftBlobGIN positions itself as an interpretable structural companion that makes protein AI predictions more transparent and auditable.

Key Points
  • SoftBlobGIN boosts binding-site detection AUROC from 0.885 (ESM-2 linear probe) to 0.983
  • Achieves 92.8% accuracy on enzyme classification with only ~1.1M additional parameters
  • Learned blobs containing active-site residues show 1.85x higher importance without any active-site supervision

Why It Matters

Makes black-box protein AI predictions auditable, crucial for drug discovery and synthetic biology transparency.