You can do a lot with an old mobile GPU these days
A complete voice AI system with Qwen3.5-9B, Whisper, and Orpheus TTS runs entirely on a 2021 laptop GPU.
A developer has created a fully local, voice-based conversational AI system that runs entirely on a single RTX 3080 Mobile GPU, hardware released in 2021. The project, built from scratch in C++ for maximum speed and minimal dependencies, integrates three specialized AI models: the Qwen3.5-9B LLM for conversation, Whisper-small for accurate speech recognition, and the Orpheus-3B model for emotive text-to-speech with the popular 'Tara' voice. All components are optimized through custom quantization (using GGUF formats like Q6_K_XL) and run within a 16GB VRAM budget, showcasing remarkable efficiency for a multi-model pipeline.
The system architecture is highly optimized, featuring a custom 'orpheus-speak' C++ app that uses a community-sourced ONNX decoder for rapid audio generation, keeping the decoder warm between utterances. The LLM operates with a 49,152-token context window—enough for hours of conversation—and uses a meticulously A/B-tested system prompt for natural engagement. While latency increases with longer responses, the project proves that advanced, real-time voice AI is achievable on last-gen consumer hardware, challenging the notion that such capabilities require cloud infrastructure or the latest GPUs.
- Runs three AI models (Qwen3.5-9B, Whisper-small, Orpheus-3B) concurrently on a single RTX 3080 Mobile GPU with 16GB VRAM
- Built entirely in C++ using GGUF-quantized models and custom ONNX decoders for minimal latency and no Python dependencies
- Features a 49K token context window and emotive TTS, proving sophisticated voice AI is viable on 2021-era consumer hardware
Why It Matters
Demonstrates that powerful, local voice AI assistants are accessible without expensive cloud subscriptions or the latest hardware.