Research & Papers

Applying Karpathy's autoresearch to a 33M-token public transit dataset (14% improvement, replication notes) [P]

An ML enthusiast tests Karpathy's LLM-driven research loop on tiny, specialized data — and it works.

Deep Dive

A transit industry professional forked Andrej Karpathy's autoresearch framework and applied it to a small 33M-token transit dataset (traffic analysis, train plans, regulatory Q&A) on a single RTX 5080. The agent ran 5-minute from-scratch training experiments to test whether the autonomous research loop holds up when the corpus is six orders of magnitude smaller than its design target and when data is specialized. The output was intended as a methodology validation, not a deployable model.

Key Points
  • Achieved 14% perplexity improvement on a 33M-token transit dataset using Karpathy's autoresearch framework.
  • Used only a single RTX 5080 (16GB) with 5-minute from-scratch training experiments.
  • Framework's single-scalar ratchet and git-as-tracker pattern worked despite corpus being 6 orders of magnitude smaller than target.

Why It Matters

Shows that LLM-driven research automation works on small, specialized datasets — democratizing AI research for niche industries.