MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video
Uses unlabelled radar video and masked autoencoding to predict human poses with privacy.
MAEPose, developed by Xijia Wei, Yuan Fang, Kevin Chetty, Youngjun Cho, and Nadia Bianchi-Berthouze, tackles a key challenge in human pose estimation: preserving privacy while maintaining accuracy. The method operates directly on millimetre-wave (mmWave) radar spectrogram videos, avoiding the information loss and added complexity of pre-processing into sparse point clouds or spectrograms. Using a masked autoencoder architecture, MAEPose learns generalized spatiotemporal features from unlabelled radar data, then employs a heatmap decoder for multi-frame pose prediction.
Tested across three datasets using leave-one-person-out cross-validation, MAEPose consistently outperformed existing baselines, achieving up to 22.1% improvement in Mean Per Joint Position Error (MPJPE) with statistical significance (p<0.05). It also proved remarkably resilient to unseen bystanders, suffering only a 6.5% error increase under zero-shot interference. Ablation studies confirmed that both the self-supervised pre-training and the heatmap decoder are critical to performance. Additionally, the team found that using Range-Doppler video alone yields better results than Range-Azimuth or a fusion of both, with lower computational cost. This positions MAEPose as a strong, privacy-friendly alternative for real-world applications like elderly care, fitness tracking, or human-computer interaction.
- First direct application of masked autoencoding to raw mmWave radar video for pose estimation, skipping intermediate representations.
- Achieves up to 22.1% lower MPJPE than state-of-the-art methods (p<0.05) across three datasets.
- Only 6.5% error increase under zero-shot bystander interference, demonstrating strong real-world robustness.
Why It Matters
Privacy-preserving human pose estimation sees a leap in accuracy and robustness, enabling real-world deployment without compromising sensitive visual data.