Consolidated my homelab from 3 models down to one 122B MoE — benchmarked everything, here's what I found
A 122B MoE model running on consumer hardware handles email, coding, and vision tasks with 39GB of headroom.
A tech enthusiast has successfully consolidated a multi-model local AI setup into a single, more powerful Mixture-of-Experts (MoE) model, demonstrating a viable path for efficient homelab deployments. After extensive benchmarking—45 tests judged by Claude Opus—the user replaced three specialized text models (totaling ~44GB) with the Qwen3.5-122B-A10B UD-IQ3_S. This 122B parameter model, with only 10B active parameters, runs on a Strix Halo system (Ryzen AI, 128GB RAM) using Vulkan/RADV for 96 GiB of shared GPU memory. It consumes 44GB of VRAM, delivers 27.4 tokens/second, and scored 440 out of 500 in performance evaluations, all while leaving 39GB of headroom.
The new single-model setup now powers a diverse workload through containerization on Proxmox. It handles email classification in under 2 seconds via a cron job, manages a personal food app for recipes and meal planning, runs a finance dashboard for taxes and portfolio tracking, and operates alongside a separate 8B vision model for camera-based person detection. A key finding was that the 3-bit IQ3 quantization performed identically to the 4-bit Q4_K_M quantization in benchmarks but used half the VRAM and was faster, validating its use for long-term deployment. The consolidation proves that large MoE models can handle concurrent tasks efficiently, simplifying the architecture from a complex multi-model router to a unified system.
- Consolidated three models (totaling ~44GB) into one Qwen3.5-122B-A10B MoE model, using 44GB VRAM and achieving 27.4 tok/s.
- The 3-bit IQ3 quantization matched 4-bit Q4_K_M performance (440 vs 438 score) while using half the VRAM and offering faster speeds.
- The single 122B MoE model now handles concurrent tasks including email processing (<2s), coding, finance, and meal planning, simplifying the homelab stack.
Why It Matters
Shows professionals how to run powerful, multi-purpose AI agents locally on consumer hardware, simplifying deployment and reducing operational overhead.