LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception
Drones get real-time guidance and scene understanding from a single 256M-parameter model on edge hardware.
LiteVLA-H targets a critical bottleneck for aerial VLA deployment: achieving low-latency, closed-loop guidance alongside rich semantic understanding on power-constrained edge devices. The model is designed for dual-rate operation on an NVIDIA Jetson AGX Orin — a common embedded platform for drones. Its fast outer-loop guidance mode produces short action-token outputs at 50.65 ms (19.74 Hz), enabling reactive flight control. A slower semantic mode handles scene understanding, hazard description, and operator-facing narration at 149.90–164.57 ms (6.08–6.67 Hz). The key insight is that end-to-end latency in this edge regime is dominated by multimodal pre-fill rather than marginal decoding cost, so the scheduler can interleave both types of tasks without redesigning the architecture.
To maintain both reactive flight skills and descriptive language competence, the authors use a knowledge-preserving fine-tuning recipe that mixes three data types: reactive flight trajectories, aerial semantic annotations, and generic caption/VQA supervision. Compared to recent state-of-the-art architectures such as AnywhereVLA, FutureVLA, and ReMem-VLA, LiteVLA-H achieves a higher edge inference rate on the action branch while retaining periodic semantic awareness. The paper is published as arXiv:2605.00884 and demonstrates a practical path for deploying VLA models on drones for real-world guidance and perception tasks.
- LiteVLA-H is a 256M-parameter VLA model optimized for onboard inference on an NVIDIA Jetson AGX Orin.
- Action branch runs at 19.74 Hz (50.65 ms latency); semantic branch runs at 6.08–6.67 Hz (149–165 ms).
- Knowledge-preserving fine-tuning combines reactive flight data, aerial semantic data, and generic caption/VQA to avoid catastrophic forgetting.
Why It Matters
Enables drones to reactively navigate and understand scenes from a single compact model on edge hardware.