Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]
Tensor-transformers make compositionality clear-as-day from weights alone, no data required.
A new research proposal from Pivotal (application deadline May 3rd) aims to crack open the black box of large language models by training tensor-transformers on toy languages built from known computational primitives. The key insight: while current mechanistic interpretability relies on post-hoc analysis of real LLMs to find patterns like induction heads and skip-trigrams, this project flips the script by constructing a controlled data-generating process where those primitives are explicitly embedded. Because tensor-transformers expose direct relationships between model components via their weight structure (unlike standard neural networks that require running data), researchers can study fundamental problems—suppression, error correction, compositional reuse—with ground-truth verification.
Early results are promising: a 2-layer attention-only tensor-transformer trained on a simple language shows clean bigram statistics in its embed→unembed mapping—e.g., 'alice' predicts 'sees' 70%, 'helps' 20%, 'finds' 10%. The proposal outlines specific research directions: improving the data-generating process with nested structures and long-range dependencies, studying dependencies during training (must X learn before Y?), building new interpretability tools, and exploiting the unique properties of tensor networks. If successful, this closed-loop approach could eventually let LLMs automate ambitious interpretability—verifying that simple descriptions replicate model behavior. The project is seeking mentees to apply by May 3rd.
- Toy languages built from known primitives (induction heads, skip-trigrams) allow ground-truth verification of model internals.
- Tensor-transformers expose compositionality directly from weights, unlike standard neural networks requiring data runs.
- Early 2-layer model already shows clear bigram statistics; project aims to scale to nested structures and long-range dependencies.
Why It Matters
Could create a self-improving cycle where LLMs automate the discovery of their own internal mechanisms.