Amazon's Promptimus auto-optimizes LLM prompts on 16 of 20 benchmarks
Four-step iteration loop improves well-crafted prompts without manual engineering
Amazon has unveiled Promptimus, a fully automated framework for optimizing well-developed LLM prompts without manual engineering. Unlike methods that generate prompts from scratch, Promptimus targets existing prompts that already encode complex business logic, regulatory requirements, and domain expertise. It operates through a four-step iteration loop: evaluation of performance, feedback generation using a metric-analyzer AI agent to identify failure points, strategy and edit generation via a debugging helper agent that pinpoints root causes, and candidate evaluation. For large, carefully structured prompts, Promptimus offers an *edit mode* that makes surgical modifications rather than rewriting the entire prompt—preserving what works while fixing exactly what’s broken.
Promptimus achieved top results on 16 of 20 enterprise benchmarks, outperforming six leading automatic prompt optimization methods. It demonstrates sample efficiency and model-agnostic generalizability across various LLMs, including Amazon Nova. The framework supports textual and multimodal tasks such as classification, extraction, summarization, code generation, and tool use. Performance criteria can be defined via Python metric functions, and debugging checkpoints are generated automatically by a code sanitization AI agent. This approach is especially valuable for regulated industries (healthcare, finance) where domain requirements like HIPAA or risk tolerance rules must be preserved while continuously improving model performance.
- Promptimus uses a metric-analyzer AI agent to identify failure points and a debugging helper agent to suggest targeted fixes, instead of random exploration.
- Outperforms six leading automatic prompt optimization methods on 16 of 20 benchmarks, with model-agnostic generalizability across LLMs.
- Edit mode makes surgical modifications to complex, structured prompts without rewriting them, preserving existing business logic and compliance rules.
Why It Matters
Saves weeks of manual prompt engineering for enterprises while preserving critical regulatory and domain-specific logic.