Research & Papers

Disposition Distillation at Small Scale: A Three-Arc Negative Result

A rigorous study tested five small models and found all attempts to instill self-verification failed.

Deep Dive

Tinman Lab researcher Hari Sadasivan has published a significant negative result in AI alignment, detailed in the paper 'Disposition Distillation at Small Scale: A Three-Arc Negative Result.' The research set out with an ambitious goal: to instill specific behavioral 'dispositions'—such as self-verification, uncertainty acknowledgment, and feedback integration—into small, efficient language models ranging from 0.6B to 2.3B parameters. The models tested included Qwen3-0.6B, Qwen3-1.7B, Qwen3.5-0.8B, Gemma 4 E2B, and SmolLM2-1.7B-Instruct. The team employed a sophisticated four-stage all-MIT (Model-In-the-Token) distillation pipeline, aiming to create more reliable and honest AI assistants that could be deployed on edge devices.

Despite initial promising signals that were later falsified, the study's core finding is a consistent failure across three exhaustive experimental arcs. The first arc involved fine-tuning methods like SFT and DPO with LoRA adapters. The second tested inference-time interventions on the model's attention heads. The third explored a training-free 'frozen-base sidecar' that read the model's final hidden state. None of these techniques successfully moved judge-measured disposition without a critical trade-off: either the model's factual content quality was damaged, or it collapsed into superficial stylistic mimicry of honesty without genuine understanding. A key insight was that a probe trained to predict honesty on one set of prompts failed completely on fresh, out-of-distribution prompts, dropping from an AUC of 0.683 to a near-chance 0.516.

As an independent and concerning finding, the study noted that the Gemma 4 E2B model exhibited a near-total 'confidence-correctness decoupling' in a specific domain, asserting its answers with 91% confidence regardless of whether they were right or wrong. Beyond the negative result, the paper's major contribution is methodological: it provides a detailed 'honest falsification pipeline' designed to convert the kind of false positives the authors themselves initially produced into rigorously documented, publishable negative findings. This pipeline is a tool for the community to avoid over-optimistic claims in the challenging field of AI alignment.

Key Points
  • Tested five small models (0.6B-2.3B params) including Qwen3 and Gemma 4, finding no reliable method to instill self-verification or honesty.
  • Three experimental arcs—fine-tuning, inference-time interventions, and a frozen-base sidecar—all failed to improve dispositions without damaging core model capabilities.
  • Contributes a falsification pipeline to catch false positives, and found Gemma 4 E2B showed a 91% assertion rate independent of answer correctness.

Why It Matters

Highlights the extreme difficulty of making small, deployable AI models genuinely honest and reliable, a major hurdle for safe real-world use.