Research & Papers

MO-CAPO optimizes prompts for both performance and cost

New algorithm finds diverse Pareto-optimal prompts while slashing inference costs

Deep Dive

Large language models are highly sensitive to prompt design, yet existing automatic prompt optimization methods focus almost exclusively on performance, ignoring real-world constraints like inference cost and latency. In a new arXiv preprint, researchers from the University of Tübingen and collaborators present MO-CAPO (Multi-Objective Cost-Aware Prompt Optimization), a novel algorithm that jointly optimizes for both performance and computational cost. The method leverages budget allocation to efficiently explore the search space and introduces a deployment-oriented cost objective that captures the full computational profile of LLM inference—including token usage, latency, and model size.

In experiments across four diverse tasks (classification, reasoning, generation, and summarization) and three LLMs (Llama 3, GPT-4o, and Claude 3.5), MO-CAPO consistently finds strong, robust, and diverse Pareto front approximations. It outperforms an NSGA-II-based multi-objective baseline on 8 out of 12 cases under the noisy R2 metric, often achieving competitive results at considerably lower optimization budgets. Importantly, MO-CAPO discovers solution sets that span meaningful performance-cost trade-offs that single-objective optimizers miss entirely, while the top-performing prompts still match or exceed those found by performance-only methods. The paper also provides the first rigorous evaluation of multi-objective machine learning experiments that accounts for generalization and robustness via noisy R2 and approximation gap metrics, offering a more realistic assessment of solution quality in applied settings.

Key Points
  • MO-CAPO jointly optimizes LLM prompt performance and inference cost using budget-aware search
  • Outperforms NSGA-II baseline on 8 out of 12 tasks across 3 LLMs (Llama 3, GPT-4o, Claude 3.5)
  • Provides diverse Pareto front solutions with competitive top performance, enabling practitioners to select cost–accuracy trade-offs

Why It Matters

Enables practical LLM deployment by balancing prompt accuracy with real-world inference budgets.