NeuroFlow accelerates ViTs 55.8x by cutting redundant background tokens
Vision Transformers waste 90% compute on stationary asphalt — NeuroFlow eliminates it.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Vision Transformers (ViTs) have become the backbone of modern video analysis, but they suffer from a fundamental inefficiency: they repeatedly process redundant background regions across frames, wasting up to 90% of compute. NeuroFlow solves this by introducing a dynamic routing framework that exploits temporal redundancy. It uses an Exponential Moving Average (EMA) of patch-level embeddings to measure 'semantic surprise' — tokens with low surprise are considered redundant and are gated out before entering the expensive self-attention layers. The framework is architecture-agnostic and requires no fine-tuning or weight modifications.
NeuroFlow offers two main architectures. Architecture C (Dual-Memory Reconstruction) combines a Layer 0 Retinal Gate with a Layer 12 Cortical Cache, achieving 71.55% zero-shot top-1 accuracy at 84.0% token sparsity on SigLIP — retaining 92.4% of dense accuracy. Architecture B (Extreme Wall-Clock Speedup) physically removes stationary tokens before the encoder, reducing inference time for a 1792p SigLIP 2 model from 678 ms to just 11.9 ms — a 55.80× speedup at 97.37% embedding fidelity. Additionally, the team ablated the approach on LLMs (Phi-3-mini), showing that similarity-gated bypass causes 0% token drift in syntactically constrained generation. Code and paper are available on GitHub.
- NeuroFlow achieves 55.8× wall-clock speedup on 1792p SigLIP 2 video inference with 97.37% embedding fidelity.
- Architecture C retains 92.4% of dense accuracy at 84.0% token sparsity, with zero-shot 71.55% top-1 accuracy.
- The method is training-free and also works on LLMs (Phi-3-mini) with 0% token drift in constrained generation.
Why It Matters
Enables real-time high-res video inference for edge devices and cloud servers without costly retraining.