Open Source

Pushing a 5-Year-Old 6GB VRAM laptop to Its Limits: Qwen3.6-35B-A3B

A 6GB RTX 2060 Max-Q laptop runs a 35B MoE model at 23 tokens per second.

Deep Dive

In a viral Reddit post, user abhinand05 documents how they pushed their 5-year-old ASUS ROG Zephyrus G14 (Ryzen 7 8C/16T, 24GB DDR4, RTX 2060 Max-Q 6GB) to run the Qwen3.6-35B-A3B model—a 35B-parameter Mixture-of-Experts architecture. Using llama-server with carefully tuned flags (CPU offloading, 36 CPU MoE threads, Q8_0 cache, 64K-128K context), they achieve ~23 tokens per second plugged in and 10+ t/s on battery, making the model genuinely usable for conversational AI. The setup also leverages Tom's fork for 128K context length, demonstrating that even budget hardware from 2020 can handle state-of-the-art open models.

The community response highlights a broader trend: open-weight models are becoming increasingly accessible. The user shared their full configuration and a blog post detailing the 'localmaxxing' journey, emphasizing how far open source has come. With proper quantization (GGUF) and aggressive CPU offloading, even a modest 6GB laptop can run a 35B model. This challenges the assumption that heavy AI inference requires expensive cloud GPUs, opening up local AI use for developers, students, and hobbyists with older equipment.

Key Points
  • Runs Qwen3.6-35B-A3B at 23 t/s on a 2020 laptop with 6GB VRAM and 24GB RAM
  • Uses llama-server with CPU offloading, Q8_0 cache, and 36 CPU MoE threads
  • Achieves 10+ t/s on battery and supports 128K context via Tom's fork

Why It Matters

Proves modern 35B open models can run locally on 5-year-old laptops, democratizing AI access.