AI Safety

SAEBER: Sparse Autoencoders for Biological Entity Risk

First interpretable biosecurity screen for protein design models explains why a protein is flagged.

Deep Dive

Researchers from the AI safety community have introduced SAEBER (Sparse Autoencoders for Biological Entity Risk), a novel approach to biosecurity screening for protein design models. The work, published on LessWrong, applies mechanistic interpretability techniques to RFDiffusion3 (RFD3) and RoseTTAFold3 (RF3) for the first time. The team trained Matryoshka BatchTopK Sparse Autoencoders (SAEs) on intermediate activations from these models, then used logistic regression probes to classify protein designs as virulent or benign. Their best probe, trained on SAE features from block 12 of RFDiffusion3, achieved an AUROC of 0.817 ± 0.10 under homology-clustered splits, outperforming the corresponding raw-activation probe by +0.054 AUROC. They also identified individual SAE features that fire on virulent designs at up to ~0.8 AUROC, with feature quality increasing with layer depth.

While SAEBER's classifier does not surpass the current SOTA (DTVF, 0.92 AUROC), it represents a significant conceptual advance. Existing biosecurity screens like DTVF operate on sequence similarity or structural homology and offer no insight into why a protein is flagged as hazardous. SAEBER is the first method to provide structural, feature-level explanations for virulence predictions in a protein design model. This opens the door to runtime monitoring, steering, and interpretable guardrails during generation. The work extends previous efforts like FoldSAE, which discovered simple features predicting secondary structure, to all-atom diffusion models. SAEBER also builds on work by Goodfire on Evo 2 and MatterGen, and on InterProt and InterPLM for protein language models, but is unique in targeting diffusion-based protein design models for biosecurity applications.

Key Points
  • First application of mechanistic interpretability (SAEs) to RFDiffusion3 and RoseTTAFold3 for biosecurity
  • Best probe achieved 0.817 AUROC, +0.054 over raw activations, with individual features up to ~0.8 AUROC
  • Provides feature-level explanations for virulence predictions, unlike existing black-box methods like DTVF (0.92 AUROC)

Why It Matters

Enables interpretable biosecurity guardrails for protein design, crucial for safe AI-driven biotechnology.