Image & Video

Open-source pipeline turns one sentence into cinematic reel on single AMD GPU

One English sentence in, finished mp4 with characters, story, music, and voice-over out.

Deep Dive

BladeDevoff has released 'studiomi300', an open-source (Apache 2.0) pipeline that transforms a single sentence into a full cinematic reel with characters, story, music, and multi-language narration. The entire 8-stage pipeline runs sequentially on a single AMD Instinct MI300X GPU with 192 GB HBM3, completing in about 45 minutes per video. It begins with a Director Agent (Qwen3.5-35B-A3B MoE) that plans 6 shots from the input prompt, outputting structured JSON. Character keyframes are generated using FLUX.2 [klein] via reference editing (no LoRA). Per-shot animation uses Wan2.2-I2V-A14B at 1280×720 resolution with 81 frames at 16 fps, utilizing FLF2V for cut:false continuation. A vision critic (same Qwen model) auto-detects 10 failure modes (e.g., character drift, object morphing) and triggers targeted re-renders. Music is composed by ACE-Step v1, and narration uses Kokoro-82M in 9 languages. The final mix is done via ffmpeg.

Key optimizations include ParaAttention FBCache (2× speedup on Wan2.2), selective torch.compile on transformer_2, and AITER MoE acceleration for Qwen. The MI300X‘s massive memory enables all models to share the same card – a 24 GB consumer GPU would need 4-5 separate boxes. The pipeline achieves 10.4 minutes per 720p clip after optimizations. Wan-specific tips are documented: avoid 'cinematic', use lens/film tags, maintain single camera verbs, and use exact Chinese negative prompts for best results. This project won the AMD x lablab hackathon and is live on Hugging Face Spaces.

Key Points
  • End-to-end pipeline runs on one AMD MI300X (192GB) in ~45 minutes per reel, using FLUX.2, Wan2.2-I2V, Qwen3.5-35B, ACE-Step, and Kokoro.
  • Outputs 1280×720 video at 16 fps with auto-retry on 10 failure modes like character drift and object morphing, plus 9-language narration.
  • All models are Apache 2.0 or MIT; code is public on GitHub, with a live demo on Hugging Face Spaces.

Why It Matters

Democratizes cinematic video creation: one sentence, one GPU, fully open-source pipeline for professional-quality content.