Research & Papers

BODHI domain prompting lifts LLM spec generation to 96.7% Pass@1

A structured translation guide of just 15 patterns lifted LLM accuracy on OS kernel specs from 55% to 96.7% — suggesting that in highly specialized domains, carefully curated human knowledge embedded in prompts can beat iterative fine-tuning or automated prompt optimization.

Deep Dive

The challenge of generating formal specifications for low-level systems code has long been a bottleneck in formal verification. OS kernels, with their intricate C syntax and safety-critical nature, are notoriously difficult for large language models to handle. The previous state-of-the-art achieved only a 55.10% Pass@1 on the OSV-Bench benchmark — a level of accuracy too low for practical use. Enter BODHI, a domain knowledge prompting method that leaps far beyond this by adding a structured C-to-Python translation guide covering 15 semantic patterns. Tested across nine models from major providers — including Anthropic's Claude Opus 4.6, Mistral, DeepSeek, Meta's Llama, and Alibaba's Qwen — BODHI pushed Pass@1 as high as 96.73% for Claude Opus. Every model improved by at least 11 percentage points, with some gaining as much as 32 points. The result is a stark demonstration that for narrow, high-stakes programming tasks, the bottleneck is not model capability but how we structure the input.

This result is all the more striking when compared to alternative approaches. Microsoft Research's KernelGPT relies on iterative refinement, generating candidate specifications and correcting them using feedback from formal verification tools. While effective, it incurs the cost of multiple verification loops and requires tight integration with external checkers. Stanford's DSPy framework automates prompt optimization through compilation and search, but its model-agnostic nature may miss domain-specific nuances that manual engineering captures. Amazon's CodeWhisperer, a productized tool, can be tuned with domain examples but has not reported comparable gains on this benchmark. BODHI takes a simpler path: instead of adding more computation or automated search, it invests in deep human understanding of the domain — mapping the 15 most important C-to-Python translation patterns that LLMs frequently get wrong. This static guide, embedded in the prompt, provides a solid cognitive scaffold that even smaller models can leverage effectively. The implication is that for tasks where correctness is paramount and the domain is well-bounded, a small amount of high-quality human engineering can be more valuable than orders of magnitude more compute.

The broader significance of BODHI extends beyond OS kernel specifications. It reveals a pattern: as LLMs become commoditized, the competitive advantage shifts from model size to the quality of the intellectual scaffolding we wrap around them. This has direct implications for commercial code generation tools like GitHub Copilot and Amazon CodeWhisperer — they could integrate similar domain guides for safety-critical codebases (e.g., automotive software, medical devices) and achieve dramatic accuracy gains without retraining. However, the approach has real risks. The 15 patterns were manually curated specifically for C-to-Python translation in kernel code; transferring this to other languages (Rust, Ada) or other domains (networking protocols, firmware) would require significant expert effort. The OSV-Bench benchmark includes only 245 tasks, which may not capture the full complexity of real-world kernel code. Moreover, the Pass@1 metric, while impressive, does not account for the potential silent failures that could arise from overfitting to the provided patterns. The method's brittleness to kernel version changes or unusual C syntax remains an open question — what happens when a new memory model or compiler extension breaks the mapping rules?

The bottom line is this: BODHI reopens a debate the field thought it had settled. In an era obsessed with scaling laws, reinforcement learning from human feedback, and automated prompt optimization, it demonstrates that careful, manual, domain-specific prompt engineering can yield outsized returns. For practitioners, the takeaway is clear: before investing in fine-tuning or complex verification pipelines, try deeply understanding the failure modes of your LLM and encoding that knowledge into the prompt structure. For researchers, it raises a more provocative question: are we underinvesting in the human side of AI systems? BODHI suggests that the highest-leverage work in AI-assisted software engineering may not be building bigger models — it may be building better teachers for the ones we already have.

Key Points
  • A 15-pattern domain translation guide boosted LLM Pass@1 on OS kernel specs from 55% to 96.7% — proving that static human knowledge can outperform iterative refinement and automated prompt optimization in niche code generation tasks.
  • All nine tested models improved by 11–32 percentage points, showing the method's robustness across model families from Anthropic, Meta, Alibaba, and others.
  • The manual curation effort limits transferability to other domains, but the underlying principle — embed deep domain expertise into prompt structure — can be replicated for any well-bounded, safety-critical coding task.

Why It Matters

BODHI shows that strategic human prompt engineering still beats automated methods for specialized LLM code generation.