Audio & Speech

AI can now detect biological relatives from voice alone

ReDimNet achieves 20.8% error rate in zero-shot kinship verification...

Deep Dive

A new paper from arXiv by Jagabandhu Mishra and Tomi Kinnunen tackles the under-explored task of kinship verification (KV) from voice—determining whether two speakers are biologically related using only audio. Leveraging the large-scale KAN-AV dataset, they propose a revised evaluation protocol controlling for confounders and using family-disjoint train-test splits to handle open-set KV. They test three state-of-the-art neural speaker embedding extractors: ECAPA-TDNN, WavLM-ECAPA, and ReDimNet, combined with various back-ends.

Results highlight the difficulty of the task. In zero-shot KV that includes same-speaker target trials, ReDimNet achieves the best equal error rate (EER) of 20.8%. However, under strict kin trials (excluding same-speaker pairs), performance degrades to 39.7%. Their best trainable back-end, which uses asymmetric processing to mitigate age-difference effects, achieves 32.0% EER (18.6% when same-speaker trials are included). The findings show that current speaker embeddings encode familial cues but are far from reliable for real-world kinship detection, offering a promising foundation for future work in voice-based family analysis.

Key Points
  • ReDimNet achieved 20.8% EER in zero-shot kinship verification with same-speaker trials
  • Performance dropped to 39.7% EER under strict kin-only trials, showing significant difficulty
  • Trainable back-end with asymmetric embedding processing mitigated age differences, reaching 32.0% EER

Why It Matters

Voice-based kinship verification could enable new biometric applications in forensics, genealogy, and smart assistants.