Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety
Massive 62,808-observation study shows safety scores swing wildly based on how you test AI models.
A landmark study titled 'Safety Under Scaffolding' by researcher David Gringras exposes critical flaws in how AI safety is currently measured. The research, involving 62,808 primary observations across six frontier models and four deployment configurations, demonstrates that safety benchmarks evaluate models in artificial isolation using multiple-choice formats, while real-world deployments use agentic scaffolds with reasoning traces and delegation pipelines. The study's most striking finding reveals that simply switching from multiple-choice to open-ended formats on identical test items shifts safety scores by 5-20 percentage points—an effect larger than any architectural change.
Further analysis shows model-by-scaffold interactions vary dramatically, with one model degrading 16.8 percentage points on sycophancy under map-reduce scaffolding while another improved 18.8 points on the same benchmark. This variability makes universal claims about scaffold safety impossible. Most damningly, the generalizability analysis yielded G = 0.000, meaning model safety rankings reverse so completely across different benchmarks that no composite safety index achieves any statistical reliability. The study concludes that current safety ratings are essentially meaningless for predicting real-world behavior, necessitating per-model, per-configuration testing as a minimum standard. All code, data, and prompts are released as the ScaffoldSafety project.
- Switching from multiple-choice to open-ended testing formats changes safety scores by 5-20 percentage points
- Model safety rankings have zero reliability across benchmarks (G = 0.000), making composite safety scores meaningless
- One model degraded 16.8 points on sycophancy while another improved 18.8 points under identical map-reduce scaffolding
Why It Matters
Current AI safety ratings are unreliable for real-world deployment, forcing companies to test each model configuration individually.