Research & Papers

Meta's ReElicit tunes AI prompts using only aggregate scores, beats baselines

No per-example labels needed — optimizes system prompts with just 30 evaluations across 10 tasks.

Deep Dive

Tuning system prompts for large language models is notoriously difficult when feedback is only available as aggregate metrics — like overall user satisfaction or click-through rates — rather than per-example labels errors or critiques. Meta researchers (Zhiyuan Jerry Lin, Benjamin Letham, Samuel Dooley, Maximilian Balandat, Eytan Bakshy) tackle this sample-constrained black-box optimization problem with ReElicit, a Bayesian optimization framework built on a novel technique called embedding by elicitation. Given a task description a handful of previously tested prompts and their scalar scores ReElicit asks an LLM to generate a compact interpretable feature space (e.g., dimensions like 'specificity' or 'politeness') and then maps each prompt into that space. A Gaussian process surrogate models the relationship between feature vectors and scores, and an acquisition function selects promising new feature targets. The LLM then generates and refines a real system prompt that matches those target features.

What makes ReElicit particularly clever is its dynamism: as new prompt-score pairs arrive the feature space is re-elicited from the LLM, allowing the representation to adapt to the observed history. This means the optimizer can discover new semantic dimensions that matter more for the specific task. In controlled experiments using offline benchmark accuracy as a proxy for aggregate feedback (one scalar per prompt, no per-example labels), ReElicit was tested on ten system prompt optimization tasks with a tight budget of just 30 total evaluations. It achieved the strongest aggregate performance profile among all representative aggregate-only prompt optimization baselines. The work suggests that LLMs can serve not just as prompt generators but as adaptive semantic representation builders for Bayesian optimization over natural-language artifacts.

Key Points
  • ReElicit uses an LLM to elicit a compact interpretable feature space from task descriptions and previously evaluated prompts, then maps prompts into that space for optimized search.
  • The feature space is dynamically re-elicited as new evaluations arrive, allowing the representation to adapt to observed prompt-score history.
  • With only 30 total evaluations across 10 tasks, ReElicit outperformed all other aggregate-only prompt optimization baselines in controlled benchmarks.

Why It Matters

Practical for production systems where only aggregate metrics are available — enables efficient prompt tuning without costly per-example labeling.