Research & Papers

Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks

arXiv cs.CV March 19, 2026

⚡A new AI framework slashes drone communication overhead by 85% while improving collaborative object detection.

Deep Dive

A research team led by Yunting Xu has developed a novel framework called Base-Station-Helped UAV (BHU) to solve a critical bottleneck in drone swarm operations: transmitting massive amounts of visual data over wireless networks. The system tackles this by implementing a smart 'Top-K' selection mechanism that identifies and transmits only the most informative pixels from each drone's RGB camera feed. This sparsified data is then sent efficiently to a ground server using multi-user MIMO (MU-MIMO) technology, drastically cutting the required bandwidth.

On the ground server, a powerful vision model built on a Swin-Large architecture backbone, specifically a MaskDINO encoder, takes over. It processes the sparse data from multiple drones, fuses their different viewpoints into a unified bird's-eye-view (BEV) representation, and performs cooperative perception tasks like detecting and tracking ground vehicles. To optimize the entire pipeline, the team developed a sophisticated diffusion model-based deep reinforcement learning (DRL) algorithm. This AI 'brain' dynamically makes three key decisions: which drones should collaborate, what level of image sparsification to use, and how to configure the wireless precoding matrices, all to balance communication efficiency with perception accuracy.

The results, validated on the Air-Co-Pred dataset, are significant. The BHU framework demonstrates a dual victory: it enhances overall perception performance by more than 5% while simultaneously slashing communication overhead by a remarkable 85% compared to conventional convolutional neural network (CNN) based fusion methods. This breakthrough provides a practical, communication-efficient solution for deploying intelligent drone swarms in real-world, resource-constrained low-altitude wireless environments, paving the way for scalable applications in surveillance, logistics, and emergency response.

Key Points

Uses a Top-K pixel selector to sparsify drone images, reducing data volume before transmission.
Employs a Swin-Large-based MaskDINO encoder for bird's-eye-view feature fusion and object perception.
A diffusion-based DRL algorithm jointly optimizes UAV selection, sparsification, and wireless precoding, cutting comms overhead by 85%.

Why It Matters

Enables real-time, large-scale drone swarms for surveillance and delivery by solving the critical bandwidth bottleneck.

Read Original Article

Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks

Why It Matters

Stay Ahead in AI