Image & Video

HFViT: Hybrid CNN-Transformer cuts HEVC encoding time while boosting quality

New hybrid model reduces video compression penalty by up to 7.9% with minimal CPU overhead.

Deep Dive

The recursive quad-tree partitioning in High Efficiency Video Coding (HEVC) is computationally expensive, with rate-distortion optimization for CTU partition prediction consuming most encoding time. Existing deep learning accelerators face a trade-off: CNNs are fast but miss long-range dependencies, while transformers capture global context but incur prohibitive CPU latency—a critical issue for deployment on CPU-bound systems.

To solve this, Krishna Kumar Sharma and Somdyuti Paul present HFViT (Hybrid Fast Vision Transformer), which fuses a reparameterized depthwise-separable convolutional backbone with a Hierarchical Attention Transformer (HAT) using a carrier token scheme for efficient global information propagation at sub-quadratic complexity. Post-training structural fusion collapses batch normalization to further reduce latency. On standard JCT-VC test sequences, HFViT reduces average VMAF BD-rate penalty by 2.4, 2.6, and 7.9 percentage points on Classes A, B, and E respectively versus the ETH-CNN baseline. It maintains CPU inference latency within 8% of the CNN baseline and surpasses it on GPU by 40%, making real-time encoder integration practical.

Key Points
  • HFViT achieves 2.4%, 2.6%, and 7.9% lower VMAF BD-rate penalty on Classes A, B, and E compared to ETH-CNN baseline.
  • CPU inference latency is within 8% of the pure CNN baseline, addressing deployment concerns for CPU-bound systems.
  • GPU inference is 40% faster than the CNN baseline, enabled by the carrier token scheme for sub-quadratic global context propagation.

Why It Matters

This hybrid model enables faster, higher-quality video encoding, reducing bandwidth use without sacrificing real-time performance.