Image & Video

I built a free local video captioner specifically tuned for LTX-2.3 training —

A free, local tool uses Gliese-Qwen3.5-9B to create captions so accurate they can reconstruct the original video.

Deep Dive

A developer known as WildSpeaker7315 has released a free, open-source tool designed to generate high-quality video captions specifically for training the LTX-2.3 AI model. The core innovation is that the captions are so descriptively accurate that feeding them back into LTX-2.3 should allow the model to reconstruct the original video, proving their quality for training data. The tool accepts videos, images, or mixed folders for batch processing and outputs captions in a single-paragraph, cinematic prose style formatted for Musubi LoRA training.

Powered by the fine-tuned Gliese-Qwen3.5-9B vision-language model, the application runs 100% locally with no API costs, requiring only an initial model download. It features a sophisticated 'focus injection' system that lets users steer the AI's attention to specific aspects like fabric texture, body motion, or facial details. Notably, the system prompt includes specialized vocabulary for adult content, with rules for describing sheer fabric, garment removal, and explicit motion, while also adapting seamlessly to fully clothed, SFW material. It's designed to run on consumer hardware from the RTX 3000 series and up, with automatic CPU offloading for cards with lower VRAM, and offers a simple Gradio UI launched via a START.bat file.

Key Points
  • Generates captions in Musubi LoRA format accurate enough for LTX-2.3 to potentially reconstruct source videos.
  • Powered by the fine-tuned Gliese-Qwen3.5-9B VLM and runs 100% locally on RTX 3000+ GPUs with no API fees.
  • Includes a focus injection system and specialized vocabulary for precise description of both SFW and NSFW content.

Why It Matters

This democratizes high-quality training data creation for video AI models, removing cost barriers and privacy concerns for researchers and creators.