StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
Researchers' new system creates action-rich image sequences from text prompts without fine-tuning, running on a single RTX 4090.
A research team led by Jinghao Hu has introduced StoryTailor, a novel zero-shot pipeline for generating action-rich, multi-subject visual narratives directly from text prompts without requiring model fine-tuning. Accepted by CVPR 2026, the system addresses the critical tension between action faithfulness, subject identity fidelity, and cross-frame background continuity that has plagued previous approaches. StoryTailor takes a long narrative prompt, per-subject reference images, and grounding boxes to produce temporally coherent image sequences, representing a significant advancement in controllable visual storytelling.
The technical architecture employs three synergistic modules: Gaussian-Centered Attention (GCA) dynamically focuses on subject cores to handle grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) amplifies action-related directions in text embeddings; and Selective Forgetting Cache (SFC) retains transferable background cues while forgetting nonessential history. Running on a single RTX 4090 with 24GB VRAM, StoryTailor demonstrates practical accessibility while achieving up to 10-15% improvement in CLIP-T scores compared to baselines. The system outperforms FluxKontext in inference speed at matched resolution and steps, delivering expressive character interactions and evolving yet stable scenes that maintain visual coherence across narrative progression.
- Zero-shot pipeline runs on single RTX 4090 (24GB) without fine-tuning
- Improves CLIP-T scores by 10-15% over baseline methods
- Three novel modules: GCA for subject focus, AB-SVR for action amplification, SFC for background continuity
Why It Matters
Enables accessible, high-quality visual storytelling for creators and businesses without requiring expensive fine-tuning or massive compute resources.