SpeakerRPL v2: Robust Open-set Speaker Identification through Enhanced Few-shot Foundation Tuning and Model Fusion
New model reduces error rate from 1.28% to 0.09% on a key benchmark, achieving near-perfect accuracy.
A research team led by Zhiyong Chen has unveiled SpeakerRPL v2, a significant upgrade to their open-set speaker identification framework. The core innovation lies in an enhanced learning objective that combines reciprocal points learning—a technique for better separating known from unknown speakers—with logit normalization (LogitNorm) and adaptive anchor learning. This trio of techniques works to more tightly constrain the model's internal representations of target speakers, dramatically improving its robustness and accuracy when faced with voices it wasn't explicitly trained on.
To address the instability often seen in few-shot learning scenarios, the team introduced a novel model fusion strategy. This approach aggregates multiple tuned models to reduce result randomness and improve generalization across diverse conditions. A complementary model selection method ensures only the best-performing variants are combined. Validated on major datasets like VoxCeleb, ESD, and 3D-Speaker, the system achieved a breakthrough result: on a challenging Vox1-O-like test set, it slashed the Equal Error Rate (EER) from 1.28% to just 0.09%, representing a 93% relative reduction in error.
The paper, accepted for ICASSP 2026, positions SpeakerRPL v2 as a state-of-the-art solution for practical speaker verification. Its ability to learn accurately from just a few examples (few-shot learning) and reliably reject imposters (open-set capability) makes it directly applicable to security-sensitive domains. The accompanying release of code allows developers and researchers to immediately test and integrate these advancements into voice biometrics and authentication pipelines.
- Achieves a 93% relative reduction in error, lowering EER from 1.28% to 0.09% on a VoxCeleb-based benchmark.
- Introduces a model fusion strategy to stabilize few-shot tuning, reducing randomness and improving generalization.
- Designed for open-set identification, meaning it can accurately reject unknown speakers not seen in training.
Why It Matters
Enables highly secure, low-error voice authentication systems that can be trained quickly with minimal user data.