Audio & Speech

Photogrammetry-Reconstructed 3D Head Meshes for Accessible Individual Head-Related Transfer Functions

Consumer photogrammetry using Apple's API produced HRTFs that performed worse than random assignments in localization tests.

Deep Dive

A new study led by researchers from Imperial College London and the University of Surrey has delivered a sobering verdict on the current feasibility of using consumer-grade photogrammetry for personalized spatial audio. The team investigated whether 3D head and ear meshes, reconstructed from 72 smartphone photos per subject using Apple's Object Capture API, could serve as a practical baseline for synthesizing individual Head-Related Transfer Functions (HRTFs). HRTFs are the acoustic filters that describe how your unique head and ear shape modifies sound, enabling accurate 3D audio rendering in headphones. The study processed data from 150 subjects in the SONICOM dataset, using the Mesh2HRTF tool to compute synthetic audio profiles from the photogrammetry-reconstructed (PR) meshes.

Despite the accessible method, the results were clear: the PR synthetic HRTFs fell short. While they preserved interaural time differences (ITDs)—the cue for left/right localization—they exhibited significant errors in interaural level differences (ILDs) and, critically, high-frequency spectral details. In a behavioral sound localization experiment with 27 participants, the PR HRTFs led to substantially higher quadrant error rates, reduced elevation accuracy, and a 27% greater rate of front-back confusions compared to gold-standard measured HRTFs. Astonishingly, on key perceptual metrics, they performed worse than even randomly assigned HRTFs. The conclusion is that current smartphone-based photogrammetry pipelines, while promising for accessibility, are limited by insufficient capture of intricate pinna (outer ear) morphology, which is essential for the monaural spectral cues that tell your brain if a sound is above, below, in front, or behind you.

Key Points
  • Study used Apple's Object Capture API on 72 phone photos per subject to create 3D head meshes for 150 people.
  • Synthetic HRTFs from these meshes caused 27% more front-back confusions and worse elevation accuracy than measured HRTFs.
  • The method failed because consumer photogrammetry lacks the detail to capture complex ear shape needed for high-frequency audio cues.

Why It Matters

This tempers expectations for quick, phone-based personalized spatial audio, showing high-fidelity ear scanning remains essential for professional VR/AR applications.