Open Source

Running Qwen3.6-35B-A3B Locally for Coding Agent: My Setup & Working Config

Local AI coding agent hits 128K context on Apple Silicon

Deep Dive

A developer has successfully configured the Qwen3.6-35B-A3B model to run locally on a MacBook Pro M2 Max (64GB unified memory) using llama.cpp and the pi coding agent. The setup leverages unsloth's custom UD-Q5_K_XL quantization, which compresses the 35B-parameter model to approximately 19GB, making it feasible for Apple Silicon hardware. Key specifications include a massive 131,072-token context window and support for up to 32,768 output tokens per generation, enabling the agent to handle long documents and extended coding sessions without context shifting.

The pi agent connects to llama-server via an OpenAI-compatible API, configured through a simple JSON file at ~/.pi/agent/models.json. The developer uses unsloth's official sampling parameters (temperature 0.6, top-p 0.95, top-k 20) and enables the preserve_thinking flag to retain the model's reasoning blocks. Batch sizes are set to 4096 for both logical and physical processing, optimizing prompt processing speed. This configuration demonstrates that high-quality local AI coding assistants are now practical on consumer-grade hardware, offering a privacy-preserving alternative to cloud-based solutions.

Key Points
  • Qwen3.6-35B-A3B runs locally on MacBook M2 Max with 64GB RAM using llama.cpp
  • Supports 131K context window and 32K output tokens for long coding sessions
  • Uses unsloth's UD-Q5_K_XL quantization (~19GB) with official sampling parameters

Why It Matters

Enables private, local AI coding assistance on consumer hardware without cloud dependencies or data leaks.