Architecture Determines Observability in Transformers
Some models hide their mistakes; others reveal them through internal signals...
Thomas Carmichael's new paper, "Architecture Determines Observability in Transformers," reveals a critical flaw in autoregressive transformers: they make confident errors that internal signals can catch, but only if the architecture preserves observability. Observability is defined as the linear readability of per-token decision quality from frozen mid-layer activations, after controlling for max-softmax confidence and activation norm. This correction is essential, as confidence controls absorb 57.7% of raw probe signal on average across 13 models in 6 families, including Pythia, Qwen 2.5, Llama, and Mistral.
The study shows observability is not generic. In Pythia's controlled suite, every 24-layer, 16-head configuration collapses to a partial correlation of ~0.10 across a 3.5x parameter gap and two dataset variants, while six other configurations maintain a healthy band from 0.21 to 0.38. The collapse emerges during training, with the signal erased in the (24L, 16H) class even as predictive loss improves. Across independent recipes, the phenomenon persists: Qwen 2.5 and Llama differ by 2.9x at matched 3B scale, and Mistral 7B preserves observability where Llama 3.1 8B collapses. Crucially, a WikiText-trained observer transfers to downstream QA tasks, catching 10.9-13.4% of all errors that confidence misses at a 20% flag rate, proving architecture selection is a monitoring decision.
- Confidence controls absorb 57.7% of raw probe signal across 13 models in 6 families
- Pythia's 24-layer, 16-head configuration collapses to ~0.10 observability, while other configs reach 0.21-0.38
- A WikiText-trained observer catches 10.9-13.4% of errors missed by confidence in 7 of 9 model-task cells
Why It Matters
Architecture choice directly impacts error detection, forcing developers to prioritize observability for reliable AI systems.