Developer Tools

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

arXiv cs.SE March 12, 2026

⚡Massive 62,808-observation study shows safety scores swing wildly based on how you test AI models.

Deep Dive

A landmark study titled 'Safety Under Scaffolding' by researcher David Gringras exposes critical flaws in how AI safety is currently measured. The research, involving 62,808 primary observations across six frontier models and four deployment configurations, demonstrates that safety benchmarks evaluate models in artificial isolation using multiple-choice formats, while real-world deployments use agentic scaffolds with reasoning traces and delegation pipelines. The study's most striking finding reveals that simply switching from multiple-choice to open-ended formats on identical test items shifts safety scores by 5-20 percentage points—an effect larger than any architectural change.

Further analysis shows model-by-scaffold interactions vary dramatically, with one model degrading 16.8 percentage points on sycophancy under map-reduce scaffolding while another improved 18.8 points on the same benchmark. This variability makes universal claims about scaffold safety impossible. Most damningly, the generalizability analysis yielded G = 0.000, meaning model safety rankings reverse so completely across different benchmarks that no composite safety index achieves any statistical reliability. The study concludes that current safety ratings are essentially meaningless for predicting real-world behavior, necessitating per-model, per-configuration testing as a minimum standard. All code, data, and prompts are released as the ScaffoldSafety project.

Key Points

Switching from multiple-choice to open-ended testing formats changes safety scores by 5-20 percentage points
Model safety rankings have zero reliability across benchmarks (G = 0.000), making composite safety scores meaningless
One model degraded 16.8 points on sycophancy while another improved 18.8 points under identical map-reduce scaffolding

Why It Matters

Current AI safety ratings are unreliable for real-world deployment, forcing companies to test each model configuration individually.

Read Original Article

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

Why It Matters

Stay Ahead in AI