New RPT framework boosts LLM reasoning by 12.9 points with automated prompt tuning
Researchers automate prompt optimization using LLM function calling and diagnostic memory.
Large language models (LLMs) have grown adept at following instructions, but prompt engineering remains a tedious, error-prone bottleneck. Existing automated prompt optimizers either brute-force search over candidates or use fixed critique-refine loops that miss systematic error patterns. Researchers from the team behind the new paper propose Reflective Prompt Tuning (RPT), which turns prompt optimization into an iterative, function-calling workflow. An LLM optimizer calls a diagnostic function that evaluates the target model over the entire optimization set, summarizes recurring failure modes, and returns a structured report. The optimizer then uses that report, plus an accumulated memory of prior reports, to revise the prompt for the next iteration. This memory mechanism allows RPT to learn from past mistakes and make targeted edits grounded in failure history, not just local examples.
Across three reasoning benchmarks, RPT improved over initial prompts by up to 12.9 percentage points and remained competitive with state-of-the-art methods. It also introduced confidence-aware optimization: the framework uses calibration signals both during diagnosis and final prompt selection, resulting in better-calibrated outputs. The paper highlights that RPT is particularly effective for multi-hop and mathematical reasoning tasks, where prompt phrasing can make or break chain-of-thought performance. The method requires no parameter updates or external databases—just access to the target model and an LLM optimizer with function-calling capabilities. As LLMs become ubiquitous across enterprise workflows, RPT promises to dramatically reduce the manual effort of prompt tuning while improving reliability and interpretability.
- RPT uses an LLM optimizer that calls a diagnostic function to analyze failure modes across the entire dataset, not just individual examples.
- It accumulates diagnostic memory across iterations, enabling targeted prompt revisions that improve reasoning by up to 12.9 points.
- The framework supports confidence-aware optimization, improving calibration on top of task accuracy.
Why It Matters
Automates prompt engineering for complex reasoning tasks, saving hours of manual tweaking while boosting accuracy and calibration.