SCMAPR: Self-Correcting Multi-Agent Prompt Refinement for Complex-Scenario Text-to-Video Generation
New framework improves video generation quality by up to 3.28% on benchmarks by using specialized agents to rewrite prompts.
A research team led by Chengyi Yang has introduced SCMAPR, a novel framework designed to tackle the persistent challenge of generating accurate videos from complex text prompts. Current text-to-video (T2V) models like Sora, Runway, and Pika often struggle with ambiguous or underspecified descriptions, leading to poor alignment between the prompt and the final video. SCMAPR addresses this by formulating prompt refinement as a multi-stage process coordinated by specialized AI agents.
The framework operates in three key stages. First, a routing agent analyzes the input prompt and classifies it into a specific scenario taxonomy to select the appropriate refinement strategy. Next, a synthesis agent creates scenario-aware rewriting policies and executes a policy-conditioned refinement of the original prompt. Finally, a verification agent performs structured semantic checks and triggers conditional revisions if any logical or descriptive violations are detected, creating a self-correcting loop.
To rigorously test their system, the researchers also introduced T2V-Complexity, a new benchmark consisting exclusively of challenging, complex-scenario prompts. Extensive experiments showed SCMAPR consistently outperformed three state-of-the-art baselines, achieving gains of up to 2.67% on VBench, 3.28% on EvalCrafter, and a 0.028 improvement on T2V-CompBench. This represents a significant step toward more reliable and controllable generative video AI.
- Uses a multi-agent system with specialized roles for routing, synthesis, and verification to refine prompts.
- Introduced the T2V-Complexity benchmark for evaluating text-to-video models exclusively on complex scenarios.
- Achieved performance gains of up to 3.28% on standard benchmarks over existing state-of-the-art methods.
Why It Matters
Enables more reliable video generation from complex descriptions, reducing the need for expert prompt engineering and trial-and-error.