Research & Papers

OmniMem framework boosts long video generation dynamic degree by 52.3%

New memory retrieval method solves the 'local bias' and 'union explosion' problems in AR video generation.

Deep Dive

Autoregressive video generation builds long videos chunk by chunk, but maintaining a growing historical KV cache becomes prohibitive. Existing approaches either truncate the cache or compress it into implicit memory, losing explicit access to query-relevant details. The new OmniMem framework addresses this by performing sparse KV retrieval over the entire history. It introduces three innovations: Adaptive Window Exclusion removes local-window blocks from selection when enough long-range history exists; Query-Shared KV Selection reduces cross-query diversity to lower computational overhead; and Per-Head Scattered KV Access lets each attention head retrieve its own non-contiguous blocks without expanding into a large buffer. These techniques together maintain efficient memory usage while preserving explicit access to the full history.

In experiments on long-video benchmarks, OmniMem outperformed strong baselines by 52.3% in Dynamic Degree—a measure of scene movement and temporal variation—while preserving consistency. The framework maintains comparable memory footprint to compression-based methods and runs within standard VRAM constraints. This work is particularly relevant as video diffusion and AR generation models (e.g., OpenAI's Sora, Google's Veo) push toward minute-long clips. By enabling efficient long-range memory retrieval, OmniMem could become a key building block for next-generation video generation systems that need both temporal coherence and dynamic action.

Key Points
  • Adaptive Window Exclusion removes local bias by dropping near-future blocks when long-range history suffices.
  • Per-Head Scattered KV Access avoids Union Explosion by keeping each attention head's selections separate and non-contiguous.
  • Achieves 52.3% higher Dynamic Degree than baseline methods while maintaining comparable memory usage.

Why It Matters

Enables longer, more dynamic AI-generated videos without exploding memory, critical for next-gen video generation tools.