Research & Papers

[D] Why does it seem like open source materials on ML are incomplete? this is not enough...

Viral post criticizes incomplete code, missing training details, and superficial documentation in ML repos.

Deep Dive

A viral discussion on Reddit's r/MachineLearning has struck a nerve, with a top post titled "Why does it seem like open source materials on ML are incomplete?" The author, Kalli_animation, details a common frustration: attempting to deeply understand or reproduce ML work often reveals repositories missing critical code, training details like hyperparameters and random seeds, and any documentation of the authors' reasoning and failed attempts. They contrast this with the exemplary, educational work of figures like Andrej Karpathy, whose projects like nanoGPT and llm.c are noted for their clarity and depth.

The post has sparked a major community debate on the underlying causes. Commenters and the original poster speculate on several key reasons: the intense speed of the field leaving no time for proper documentation, the competitive pressure in both academia (for citations) and industry (to protect IP) that disincentivizes full transparency, and the simple fact that producing clean, reproducible code with thorough reasoning is a significant and often unrewarded extra effort. The core complaint extends beyond code to a desire for the "narrative" behind models—the trade-offs considered and the dead ends encountered during development, which are rarely published.

Key Points
  • User criticizes norm of "weights + basic inference code" instead of full reproducible pipelines and training details.
  • Andrej Karpathy's projects (nanoGPT, llm.c) highlighted as positive exceptions with clean, educational code.
  • Community debate cites speed, competition, and lack of incentive as root causes for incomplete open-source ML.

Why It Matters

This hampers scientific progress, slows developer education, and creates a barrier to entry for building on state-of-the-art research.