Dual DGX Sparks vs Mac Studio M3 Ultra 512GB: Running Qwen3.5 397B locally on both. Here's what I found.
A developer spent $2K/month on Claude API, then tested two $10K local setups running Qwen3.5 397B.
Faced with a $2,000 monthly bill for Claude API tokens, a developer invested in two $10,000 local hardware setups to run the massive 397-billion parameter Qwen3.5 model. The first was an Apple Mac Studio M3 Ultra with 512GB of unified memory, using MLX 6-bit quantization to load a 323GB model. It achieved 30-40 tokens per second, leveraging its exceptional ~800 GB/s memory bandwidth for smooth generation, but suffered from slow prefill times and compute limitations for parallel tasks like embeddings.
The second setup was a dual NVIDIA DGX Spark configuration, using INT4 AutoRound quantization to fit a 98GB model per node across two 128GB GPUs via vLLM. It generated 27-28 tokens per second, with its CUDA tensor cores and vLLM kernels providing significantly faster compute for prefill and batch processing. However, its setup was notoriously difficult, involving unstable networking, memory ceiling tuning, and thermal throttling issues that took days to stabilize.
Ultimately, the developer kept both systems, creating a hybrid architecture. The Mac Studio now handles the primary Qwen3.5 397B inference, dedicating its full 512GB memory pool to the model. The dual DGX Sparks are dedicated to running the separate Qwen3 Embedding 8B and Reranker 8B models for a Retrieval-Augmented Generation (RAG) pipeline, preventing memory competition and leveraging CUDA's strength for those tasks. The systems communicate over Tailscale, forming a powerful, cost-effective personal AI assistant that replaces the expensive cloud API.
- Mac Studio M3 Ultra (512GB) ran Qwen3.5 397B at 30-40 tok/s with 800GB/s bandwidth but slow compute for embeddings.
- Dual DGX Sparks ran the same model at 27-28 tok/s with faster CUDA compute but a brutally complex setup process.
- Final hybrid architecture uses Mac for main model inference and Sparks for dedicated RAG pipeline tasks (embedding, reranking).
Why It Matters
This real-world test reveals the practical trade-offs for professionals considering high-end local AI deployment to control costs and latency.