Multidimensional evals vary runtime environments (e.g., gVisor vs runc) alongside base models?

Multidimensional evals vary runtime environments (e.g., gVisor vs runc) alongside base models

tested gVisor vs runc across Anthropic/OpenAI models; found gVisor safer

OpenSSL leaderboard comparing legacy, proof-stack-retrofitted, and greenfield implementations on attack surface

AI Safety

New eval method tests AI security across multiple dimensions

LessWrong AI June 28, 2026

⚡Vary the sandbox, not just the model — a new approach to cyberhardening evals

Deep Dive

Quinn on LessWrong argues that current AI evals are too narrow: typically 1 x n or k x n (models x samples). Instead, we need k x n x m evals that vary multiple dimensions—like sandbox runtimes, proof stacks, or code implementations—while holding the red-teaming LLM constant. This flips the evaluation: the LLM becomes the oracle, and the infrastructure/components become the variable. The box-arena example from Apart's AI Control hackathon tested container invariant violations across base models (Anthropic, OpenAI) and OCI runtimes (gVisor, runc). The preliminary result: gVisor provides stronger isolation. This method scales because tokens are cheap and new runtimes (e.g., Lean-written OCI runtimes) will appear.

Extending the idea, Quinn envisions leaderboards for critical libraries like OpenSSL. Soon developers could choose between legacy forks, partially retrofitted proof stacks (refinedC), greenfield Rust, or Lean implementations. Instead of just performance benchmarks, we can use LLMs as blackhat oracles to measure relative vulnerability rates. Similar leaderboards could apply to curl, Docker, Postgres, or Linux. The goal: empirical, multidimensional security comparisons that go beyond trusting a single proof. This is an ambitious but measurable path to cyberhardening—and one that's deployable now with existing LLM capabilities.

Key Points

Multidimensional evals vary runtime environments (e.g., gVisor vs runc) alongside base models
Box-arena example: tested gVisor vs runc across Anthropic/OpenAI models; found gVisor safer
Aspirational: OpenSSL leaderboard comparing legacy, proof-stack-retrofitted, and greenfield implementations on attack surface

Why It Matters

Enables empirical security comparisons across different AI implementations and infrastructure choices, not just model quality.

Read Original Article

New eval method tests AI security across multiple dimensions

Why It Matters

Related Articles

🚀 Stay Ahead in AI