Zero-shot pipeline runs on single RTX 4090 (24GB) without fine-tuning?

Zero-shot pipeline runs on single RTX 4090 (24GB) without fine-tuning

Improves CLIP-T scores by 10-15% over baseline methods?

Improves CLIP-T scores by 10-15% over baseline methods

GCA for subject focus, AB-SVR for action amplification, SFC for background continuity

Research & Papers

StoryTailor's zero-shot pipeline generates coherent multi-subject visual narratives

arXiv cs.CV February 26, 2026

⚡Researchers' new system creates action-rich image sequences from text prompts without fine-tuning, running on a single RTX 4090.

Deep Dive

A research team led by Jinghao Hu has introduced StoryTailor, a novel zero-shot pipeline for generating action-rich, multi-subject visual narratives directly from text prompts without requiring model fine-tuning. Accepted by CVPR 2026, the system addresses the critical tension between action faithfulness, subject identity fidelity, and cross-frame background continuity that has plagued previous approaches. StoryTailor takes a long narrative prompt, per-subject reference images, and grounding boxes to produce temporally coherent image sequences, representing a significant advancement in controllable visual storytelling.

The technical architecture employs three synergistic modules: Gaussian-Centered Attention (GCA) dynamically focuses on subject cores to handle grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) amplifies action-related directions in text embeddings; and Selective Forgetting Cache (SFC) retains transferable background cues while forgetting nonessential history. Running on a single RTX 4090 with 24GB VRAM, StoryTailor demonstrates practical accessibility while achieving up to 10-15% improvement in CLIP-T scores compared to baselines. The system outperforms FluxKontext in inference speed at matched resolution and steps, delivering expressive character interactions and evolving yet stable scenes that maintain visual coherence across narrative progression.

Key Points

Zero-shot pipeline runs on single RTX 4090 (24GB) without fine-tuning
Improves CLIP-T scores by 10-15% over baseline methods
Three novel modules: GCA for subject focus, AB-SVR for action amplification, SFC for background continuity

Why It Matters

Enables accessible, high-quality visual storytelling for creators and businesses without requiring expensive fine-tuning or massive compute resources.

Read Original Article

StoryTailor's zero-shot pipeline generates coherent multi-subject visual narratives

Why It Matters

Related Articles

🚀 Stay Ahead in AI