SPEAR uses Python to auto-optimize prompts, beating GPT-4 baselines
An agent that writes its own Python analysis to fix prompt errors
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new paper from researchers at multiple institutions (Mengyin Lu, Cong Feng, Huimin Han, et al.) introduces SPEAR (Sandboxed Prompt Engineer with Active Roll-back), a free-form agentic optimizer that treats prompt engineering as a code-as-action loop. Unlike prior automatic prompt engineering (APE) methods that rely on fixed optimization pipelines, SPEAR autonomously decides when and how to use four tools: evaluate, python, set_prompt, and finish. Its standout feature is a Python sandbox that writes and executes arbitrary code on the current evaluation DataFrame, performing structural error analysis such as confusion matrices, error clustering, and per-group metrics—analysis that the agent itself authors. Two guardrails ensure monotone improvement: auto-rollback on metric regression and an optional guard metric floor.
SPEAR is evaluated on three industrial LLM-as-judge suites (13 judge tasks across recruiter-intake, conversational-memory, and query-refinement systems) plus seven BBH tasks and GSM8K. It wins every industrial task on the primary metric: κ=0.857 vs 0.359 on tool-selection, F1-macro 0.815 vs 0.763 on filter-relevance, and κ=0.254 vs 0.218 on the hardest extraction dimension. On BBH-7, SPEAR averages 0.938 accuracy vs GEPA's 0.628 and TextGrad's 0.484. Ablations show the Python tool is the largest single lever—removing it drops κ by ~0.79 on the 5-class tool-selection judge and ~0.35 on the hardest extraction dimension. The key insight: long-context LLMs cannot reliably extract class-pair confusion from raw evaluation DataFrames, but code can.
- SPEAR uses a Python sandbox tool to write and execute custom error analysis code (confusion matrices, clustering) autonomously
- Achieves κ=0.857 on industrial tool-selection judge tasks vs 0.359 for baselines; 0.938 accuracy on BBH-7 vs 0.628 for GEPA
- Ablation shows removing the Python tool causes Δ≈-0.79κ on complex tasks, proving code-based analysis is irreplaceable
Why It Matters
Enables AI to autonomously debug its own prompts using code, drastically improving reliability in production LLM systems.