Runtime-Augmented LLMs for Crash Detection and Diagnosis in ML Notebooks
New system predicts Jupyter notebook crashes before execution with 7-10% accuracy gains.
A research team from multiple institutions has introduced CRANE-LLM, a breakthrough system that predicts crashes in machine learning notebooks before code execution. The approach addresses a critical pain point in ML development: Jupyter notebooks are notoriously prone to bugs, with crashes disrupting iterative experimentation workflows. CRANE-LLM works by augmenting large language models with structured runtime information extracted from the notebook kernel state, including object types, tensor shapes, and data attributes from previously executed cells.
Technically, the system combines this runtime data with static code context to predict whether a target cell will crash (detection) and explain the underlying cause (diagnosis). The researchers evaluated CRANE-LLM on JunoBench, a comprehensive benchmark of 222 ML notebooks comprising 111 pairs of crashing and corresponding non-crashing notebooks across multiple ML libraries and crash root causes. Across three state-of-the-art LLMs (Gemini, Qwen, and GPT-5), runtime information improved crash detection and diagnosis by 7-10 percentage points in accuracy and 8-11 in F1-score, with larger gains for diagnosis tasks.
The improvements varied across ML libraries, crash causes, and LLMs, demonstrating that effective crash prediction depends on integrating complementary categories of runtime information. This research represents a significant advancement over traditional debugging approaches, which typically occur after crashes have already disrupted workflows. By providing pre-execution warnings and diagnoses, CRANE-LLM could transform how data scientists develop ML models, reducing frustration and accelerating experimentation cycles. The system's architecture suggests future applications could extend beyond notebooks to other interactive development environments.
- CRANE-LLM improves crash detection accuracy by 7-10 percentage points across Gemini, Qwen, and GPT-5 models
- System analyzes runtime information including object types and tensor shapes from 222-notebook JunoBench benchmark
- Provides both crash prediction and diagnostic explanations before cell execution, saving debugging time
Why It Matters
Data scientists can catch ML notebook errors before execution, reducing debugging time by 10-20% and accelerating model development cycles.