noisekit CLI generates realistic noisy speech datasets for STT benchmarking
Benchmark STT vendors with degraded phone calls, not clean studio recordings.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
If you've ever tried to pick an STT vendor for a phone-based voice agent or call center product, you've likely hit this wall: real production audio is unlabeled, so you can't compute Word Error Rate (WER). Public datasets like FLEURS, CommonVoice, and LibriSpeech are clean studio recordings that bear no resemblance to how STT models handle G.711-encoded noisy phone calls. Annotating production audio is slow, expensive, and a privacy headache. Most teams end up benchmarking on clean data and discovering in production which vendor actually survives noise.
noisekit fills that gap. You take a clean annotated dataset, apply degradations that approximate your production conditions, and end up with a noisy annotated corpus you can run WER on across every STT candidate. Presets cover telecom (G.711 narrowband bandpass + 8-bit BitCrush + 16–32 kbps MP3), real ambient noise (auto-downloads MUSAN noise subset or bring your own), far-field reverb (pyroomacoustics at 1–3 m mic distance), low bitrate MP3, and clipping. Compound chains stack realistically—e.g., noisy room then phone codec. Output is HuggingFace AudioFolder-compatible, and each sample includes PESQ, SNR, and NISQA scores in metadata.jsonl. The repo is github.com/karamouche/noisekit (MIT, uvx-runnable, zero install).
- Supports degradation presets: telecom (G.711 8-bit + 16-32 kbps MP3), noise (MUSAN ambient at 5-15 dB SNR), reverb (pyroomacoustics far-field 1-3 m).
- Compound chains simulate real-world conditions like noisy room then phone codec.
- Output is HuggingFace AudioFolder-compatible with metadata including PESQ, SNR, NISQA scores for correlating WER with signal quality.
Why It Matters
Enables accurate STT vendor selection for voice agents by simulating real-world production noise conditions.