Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding
New AI framework prioritizes bandwidth for user-described video elements, achieving superior quality with 50% less data.
A research team led by Jingxuan Men has introduced Video TokenCom, a groundbreaking framework that fundamentally rethinks video compression by integrating user intent directly into the encoding process. Motivated by the rise of Large AI Models (LAMs), the system treats video as discrete tokens—unified units for communication and computation. The core innovation is its use of a text prompt from the user (like 'the soccer player kicking the ball') to guide the AI in identifying which parts of the video are semantically critical. This intent is then fused with optical-flow analysis to track these important elements across space and time, creating a semantic importance map for the entire video stream.
Technically, Video TokenCom implements a semantic-aware, multi-rate bit allocation strategy combined with Unequal Error Protection (UEP)-based source-channel coding. Tokens deemed highly relevant to the user's textual intent are encoded with full precision, while less important background tokens are compressed using a reduced-precision, differential encoding scheme. This adaptive approach dynamically allocbits and applies stronger error correction to vital semantic data based on available bandwidth and channel conditions (SNR). Experiments show it consistently outperforms both conventional codecs and other semantic communication baselines in perceptual and semantic quality across a wide range of signal-to-noise ratios, demonstrating a more efficient path for goal-oriented video transmission in bandwidth-constrained wireless networks.
- Uses textual user prompts to identify and prioritize semantically critical video elements for encoding.
- Implements Unequal Error Protection (UEP) to dynamically allocate bandwidth and error correction based on semantic importance.
- Outperforms conventional and semantic baselines, achieving higher quality with the same or less data across varying network conditions.
Why It Matters
Enables high-quality video calls and streaming in low-bandwidth environments by intelligently preserving only what the user cares about.