Qwen 3.5 35b on 8GB Vram for local agentic workflow
A user achieves 42 tokens/sec generation with a quantized Qwen model on a consumer laptop GPU.
A developer has demonstrated that powerful local AI coding assistants are now viable on consumer hardware. By using a heavily quantized version of Alibaba's Qwen 3.5 35B model (the Q4_K_M GGUF variant via llama.cpp), they achieved respectable performance on a laptop with an RTX 4060m GPU limited to 8GB of VRAM. The configuration, which offloads 99 layers to the GPU and uses advanced caching, yields 42 tokens per second for generation—fast enough for interactive coding. This setup replaces the cloud-based Google Antigravity (Gemini) service with a local, private alternative.
The core of the workflow uses the Cline extension in VS Code, configured to use different models for planning and acting phases—a hallmark of agentic AI. The user employs 'kat-coder-pro' for the 'Plan' step and the local Qwen 3.5 for the 'Act' step. This proof-of-concept shows that with careful model selection (quantization) and tool configuration (llama.cpp), professionals can run sophisticated, multi-step AI agents entirely offline. It highlights a significant shift towards democratizing high-performance AI development tools, reducing reliance on paid, rate-limited cloud APIs for core programming tasks.
- Alibaba's Qwen 3.5 35B model runs locally at 42 tokens/sec on an 8GB VRAM laptop GPU using llama.cpp quantization.
- The workflow uses the Cline VS Code extension for an agentic setup, separating 'Plan' and 'Act' tasks between models.
- This provides a functional, offline alternative to cloud services like Google's Antigravity, bypassing API limits and costs.
Why It Matters
It proves that effective, agentic AI coding assistants can run offline on standard developer laptops, reducing dependency on cloud APIs.