Research & Papers

LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification

New framework cuts latency gap by 9x while preserving >10% accuracy gains on complex temporal queries.

Deep Dive

A research team led by Gourav Datta has introduced LE-NeuS, a novel neuro-symbolic framework designed to make complex video understanding practical for real-world applications. Existing methods that combine neural networks with symbolic logic for long-form video question answering (LVQA) have shown impressive accuracy gains by using formal verification for temporal reasoning. However, these approaches came with a crippling drawback: they were up to 90 times slower than simply prompting a base Vision-Language Model (VLM), rendering them unusable for latency-sensitive scenarios like autonomous systems or real-time monitoring. LE-NeuS directly tackles this bottleneck, aiming to preserve the reasoning benefits while drastically cutting computational costs.

The key innovation lies in optimizing the two most expensive steps of neuro-symbolic video analysis. First, LE-NeuS employs a CLIP-guided, two-stage adaptive sampling technique that intelligently skips semantically redundant video frames while carefully preserving critical temporal boundaries where actions change. Second, it implements batched proposition detection, which parallelizes the VLM's work across temporal windows instead of processing frames sequentially. Tested on benchmarks like LongVideoBench and Video-MME using NVIDIA H100 GPUs, the framework successfully reduced the latency gap from 90x to roughly 10x. Crucially, it maintained accuracy improvements of over 10% on temporally complex queries, such as those requiring understanding of event sequences or causality. This breakthrough bridges the gap between high-accuracy symbolic reasoning and the speed demands of deployable AI.

Key Points
  • Reduces neuro-symbolic video AI latency from 90x to ~10x overhead compared to base VLM prompting
  • Maintains >10% accuracy gains on temporally complex queries in benchmarks like LongVideoBench
  • Uses CLIP-guided adaptive sampling and batched VLM inference to skip redundant frames and parallelize work

Why It Matters

Enables complex, reasoning-based video analysis for real-time applications like autonomous vehicles and security, moving it from research to deployment.