AI Safety

ARC launches $100K White-Box Estimation Challenge for MLPs

Improve mechanistic interpretability algorithms and win cash prizes. Warm-up starts this week.

Deep Dive

The Alignment Research Center (ARC) has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest designed to spur innovation in estimation algorithms for random multi-layer perceptrons (MLPs). The warm-up round begins this week, with a total prize pool of at least $100,000 across later rounds. Participants must design an algorithm that takes a set of weights (from an MLP with fixed width and number of hidden layers) and produces an estimate for the expected output. Algorithms are evaluated on MLPs with randomly-sampled Gaussian weights, aiming for the lowest mean squared error under a FLOP-counting scheme that levels the playing field by minimizing advantages from optimized numerical kernels.

ARC's long-term goal is to develop white-box methods that can answer safety-relevant questions about highly intelligent AI systems — such as whether they might undermine human control in unusual situations. Black-box sampling may fail if the system avoids honey pots, so white-box approaches that leverage model internals are crucial. This challenge builds on ARC's recent paper showing white-box methods that outperform black-box for wide MLPs but break down as depth increases. Contestants can use any method (including black-box), but ARC expects the best solutions to be mechanistic. The contest also explores the use of LLMs in algorithm design; there will be two prize categories: one for top performance, and one for the favorite algorithmic contribution described in a technical report.

Key Points
  • Prize pool of at least $100,000 for later rounds; warm-up round starts this week.
  • Challenge: design algorithm to estimate expected output of random MLPs (fixed width & hidden layers) with minimal mean squared error.
  • ARC encourages use of LLMs; two prize categories: best submission and best algorithmic contribution.

Why It Matters

This challenge could unlock scalable white-box interpretability methods, crucial for ensuring safe behavior in advanced AI systems.