ReDimNet2: Scaling Speaker Verification via Time-Pooled Dimension Reshaping
New architecture achieves 0.287% error rate on VoxCeleb1 benchmark with just 12.3M parameters.
Researchers Ivan Yakovlev and Anton Okhotnikov have introduced ReDimNet2, a significant upgrade to their ReDimNet framework for extracting speaker representations from audio. The core architectural innovation is the strategic introduction of pooling operations over the time dimension within the network's 1D processing pathway. This modification is clever because it preserves the mathematical relationship where 1D features remain a reshaped version of 2D features, regardless of temporal resolution. Crucially, this allows the model to scale its channel dimension—where much of the representational power lies—far more aggressively without incurring a proportional increase in computational cost (measured in GMACS).
They demonstrated this efficiency by creating a family of seven model configurations, named B0 through B6, ranging from a lightweight 1.1 million parameters and 0.33 GMACS to a larger 12.3 million parameters and 13 GMACS. Benchmarked on the challenging VoxCeleb1 dataset, ReDimNet2 models consistently improve the Pareto frontier, meaning they offer better accuracy for a given computational budget than the previous generation. The top-performing ReDimNet2-B6 model achieved an exceptionally low Equal Error Rate (EER) of just 0.287% on the Vox1-O clean test set. This performance pushes the state-of-the-art for speaker verification, a critical technology for biometric security and personalized voice interfaces.
The work, submitted to Interspeech 2026, provides a clear engineering blueprint for building more efficient audio neural networks. By decoupling model capacity from compute through the time-pooling mechanism, the researchers have opened a path for deploying high-fidelity speaker verification in resource-constrained environments like mobile devices and edge computing scenarios, where both accuracy and efficiency are paramount.
- Introduces time-pooling in 1D pathway, enabling scaling to 12.3M parameters without proportional compute increase.
- Achieves state-of-the-art 0.287% Equal Error Rate on VoxCeleb1-O benchmark with 13 GMACS computational cost.
- Defines a model family (B0-B6) from 1.1M to 12.3M params that improves the accuracy-efficiency Pareto front.
Why It Matters
Enables more accurate and efficient speaker verification for security systems and voice assistants, especially on devices with limited compute.