MAEPose: Self-supervised mmWave radar pose estimation beats benchmarks by 22%
Uses unlabelled radar video and masked autoencoding to predict human poses with privacy.
MAEPose, developed by Xijia Wei, Yuan Fang, Kevin Chetty, Youngjun Cho, and Nadia Bianchi-Berthouze, tackles a key challenge in human pose estimation: preserving privacy while maintaining accuracy. The method operates directly on millimetre-wave (mmWave) radar spectrogram videos, avoiding the information loss and added complexity of pre-processing into sparse point clouds or spectrograms. Using a masked autoencoder architecture, MAEPose learns generalized spatiotemporal features from unlabelled radar data, then employs a heatmap decoder for multi-frame pose prediction.
Tested across three datasets using leave-one-person-out cross-validation, MAEPose consistently outperformed existing baselines, achieving up to 22.1% improvement in Mean Per Joint Position Error (MPJPE) with statistical significance (p<0.05). It also proved remarkably resilient to unseen bystanders, suffering only a 6.5% error increase under zero-shot interference. Ablation studies confirmed that both the self-supervised pre-training and the heatmap decoder are critical to performance. Additionally, the team found that using Range-Doppler video alone yields better results than Range-Azimuth or a fusion of both, with lower computational cost. This positions MAEPose as a strong, privacy-friendly alternative for real-world applications like elderly care, fitness tracking, or human-computer interaction.
- First direct application of masked autoencoding to raw mmWave radar video for pose estimation, skipping intermediate representations.
- Achieves up to 22.1% lower MPJPE than state-of-the-art methods (p<0.05) across three datasets.
- Only 6.5% error increase under zero-shot bystander interference, demonstrating strong real-world robustness.
Why It Matters
Privacy-preserving human pose estimation sees a leap in accuracy and robustness, enabling real-world deployment without compromising sensitive visual data.