A 15-pattern domain translation guide boosted LLM Pass@1 on OS kernel specs from 55% to 96.7% — proving that static human knowledge can outperform iterative refinement and automated prompt optimization in niche code generation tasks?

A 15-pattern domain translation guide boosted LLM Pass@1 on OS kernel specs from 55% to 96.7% — proving that static human knowledge can outperform iterative refinement and automated prompt optimization in niche code generation tasks.

All nine tested models improved by 11–32 percentage points, showing the method's robustness across model families from Anthropic, Meta, Alibaba, and others?

All nine tested models improved by 11–32 percentage points, showing the method's robustness across model families from Anthropic, Meta, Alibaba, and others.

The manual curation effort limits transferability to other domains, but the underlying principle — embed deep domain expertise into prompt structure — can be replicated for any well-bounded, safety-critical coding task?

The manual curation effort limits transferability to other domains, but the underlying principle — embed deep domain expertise into prompt structure — can be replicated for any well-bounded, safety-critical coding task.

Research & Papers

BODHI domain prompting lifts LLM spec generation to 96.7% Pass@1

arXiv cs.AI May 26, 2026

⚡A structured translation guide of just 15 patterns lifted LLM accuracy on OS kernel specs from 55% to 96.7% — suggesting that in highly specialized domains, carefully curated human knowledge embedded in prompts can beat iterative fine-tuning or automated prompt optimization.

Deep Dive

Formal verification of operating system kernels requires precise specifications for system calls, but manual writing demands deep expertise. Researchers from Zhiming Chang and Ziyang Li introduced BODHI (domain knowledge prompting), which augments standard few-shot prompts with a structured C-to-Python translation guide organized into 15 categories of domain-specific patterns. Inspired by Structured Chain-of-Thought, the guide separates pre-condition extraction and post-condition generation by concern. The method was evaluated on nine LLMs spanning dense, mixture-of-experts, and reasoning architectures from six providers.

Results show BODHI improves every tested model, with gains ranging from +11% to +32% on the OSV-Bench benchmark of 245 specification tasks. The best configuration, Claude Opus 4.6 with BODHI, reached 96.73% Pass@1, compared to the previous best of 55.10%. The technique reduces both syntax and semantic errors, especially on models with sufficient instruction-following capability. This demonstrates that domain knowledge injection is a model-agnostic technique that substantially bridges the gap between general-purpose code generation and formal specification synthesis.

Key Points

A 15-pattern domain translation guide boosted LLM Pass@1 on OS kernel specs from 55% to 96.7% — proving that static human knowledge can outperform iterative refinement and automated prompt optimization in niche code generation tasks.
All nine tested models improved by 11–32 percentage points, showing the method's robustness across model families from Anthropic, Meta, Alibaba, and others.
The manual curation effort limits transferability to other domains, but the underlying principle — embed deep domain expertise into prompt structure — can be replicated for any well-bounded, safety-critical coding task.

Why It Matters

BODHI shows that strategic human prompt engineering still beats automated methods for specialized LLM code generation.

Read Original Article

BODHI domain prompting lifts LLM spec generation to 96.7% Pass@1

Why It Matters

Related Articles

🚀 Stay Ahead in AI