AI Safety

Mechanistic estimation for wide random MLPs

No sampling needed: ARC's method predicts MLP outputs mechanistically, beating Monte Carlo.

Deep Dive

ARC's latest paper introduces a family of algorithms for mechanistic estimation of wide random multilayer perceptrons (MLPs). Instead of sampling many inputs and averaging outputs (Monte Carlo), these algorithms compute the expected output analytically using cumulant propagation. For ReLU MLPs with 4 hidden layers and width 256, the best algorithm matches Monte Carlo accuracy using as little as 1/10 the FLOPs, and outperforms it across 7 orders of magnitude of compute budgets. The method also shows dramatic gains in tail estimation — achieving under 30% relative error for probabilities 100 times lower than what Monte Carlo can sample. Additionally, the estimates are differentiable, enabling a 'mechanistic distillation' proof of concept where a student network is trained on estimated outputs. While not surpassing standard training, this opens the door to using mechanistic approaches for model analysis and training.

Key Points
  • Mechanistic estimation of random MLPs achieves up to 10x FLOPs efficiency over Monte Carlo sampling at matched mean squared error.
  • For low-probability estimation (<1e-4), algorithm achieves under 30% relative error for probabilities 100x smaller than sampling limit.
  • Proof-of-concept 'mechanistic distillation' trains a student network on differentiable mechanistic estimates, showcasing potential for broader use.

Why It Matters

ARC's approach could transform model analysis — faster, more precise, and differentiable — enabling new forms of interpretability and training.