New eval method tests AI security across multiple dimensions
Vary the sandbox, not just the model — a new approach to cyberhardening evals
Quinn on LessWrong argues that current AI evals are too narrow: typically 1 x n or k x n (models x samples). Instead, we need k x n x m evals that vary multiple dimensions—like sandbox runtimes, proof stacks, or code implementations—while holding the red-teaming LLM constant. This flips the evaluation: the LLM becomes the oracle, and the infrastructure/components become the variable. The box-arena example from Apart's AI Control hackathon tested container invariant violations across base models (Anthropic, OpenAI) and OCI runtimes (gVisor, runc). The preliminary result: gVisor provides stronger isolation. This method scales because tokens are cheap and new runtimes (e.g., Lean-written OCI runtimes) will appear.
Extending the idea, Quinn envisions leaderboards for critical libraries like OpenSSL. Soon developers could choose between legacy forks, partially retrofitted proof stacks (refinedC), greenfield Rust, or Lean implementations. Instead of just performance benchmarks, we can use LLMs as blackhat oracles to measure relative vulnerability rates. Similar leaderboards could apply to curl, Docker, Postgres, or Linux. The goal: empirical, multidimensional security comparisons that go beyond trusting a single proof. This is an ambitious but measurable path to cyberhardening—and one that's deployable now with existing LLM capabilities.
- Multidimensional evals vary runtime environments (e.g., gVisor vs runc) alongside base models
- Box-arena example: tested gVisor vs runc across Anthropic/OpenAI models; found gVisor safer
- Aspirational: OpenSSL leaderboard comparing legacy, proof-stack-retrofitted, and greenfield implementations on attack surface
Why It Matters
Enables empirical security comparisons across different AI implementations and infrastructure choices, not just model quality.