PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation
A 7B-parameter model beats GPT-4o and 100x larger models by refining prompts for physical realism.
A research team led by Shang Wu has introduced PhyPrompt, a novel reinforcement learning framework that addresses a critical flaw in current text-to-video (T2V) generators: their tendency to violate basic physical laws despite producing high-quality visuals. The researchers discovered that the problem stems not from model limitations but from insufficient physical constraints in user prompts. Their solution involves a two-stage approach where they first fine-tune a large language model on physics-focused Chain-of-Thought datasets to understand principles like object motion and force interactions, then apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity before shifting toward physical commonsense.
The technical breakthrough lies in PhyPrompt's compositional curriculum, which achieves synergistic optimization rather than conventional trade-offs. The 7B-parameter model reaches 40.8% joint success on the VideoPhy2 benchmark (an 8.6 percentage point gain), improving physical commonsense by 11 points while simultaneously increasing semantic adherence by 4.4 points. Remarkably, this specialized approach outperforms both GPT-4o (+3.8%) and the massive DeepSeek-V3 (+2.2%) despite being 100 times smaller. The framework transfers zero-shot across diverse T2V architectures including Lavie, VideoCrafter2, and CogVideoX-5B, demonstrating up to 16.8% improvement and establishing that domain-specialized reinforcement learning with smart curricula surpasses brute-force scaling for physics-aware generation.
- PhyPrompt uses Group Relative Policy Optimization with dynamic curriculum to refine prompts, achieving 40.8% joint success on VideoPhy2 benchmarks
- The 7B-parameter model beats GPT-4o by 3.8% and 100x larger DeepSeek-V3 by 2.2% on physics-aware generation
- Zero-shot transfer works across multiple T2V architectures (Lavie, VideoCrafter2, CogVideoX-5B) with up to 16.8% improvement
Why It Matters
Enables physically realistic AI video generation without expert prompt engineering, making professional-grade content creation accessible.