Open Source

Local LLM dev's sub-agent fork runs Qwen with 10GB VRAM, 200k context

A VRAM-poor developer hacked together a sub-agent system that works on a single GPU slot.

Deep Dive

A developer known as sisyphus-cycle shared a custom fork of a pi coding agent sub-agent repository designed for severely VRAM-limited local setups. With only 10GB of VRAM and a single LLM slot on llama.cpp server, most existing sub-agent extensions fail because they assume multiple model instances or large context buffers. To solve this, they combined a partially vibe-coded fork with the Qwen3.6-35b-A3B model, leveraging multi-token prediction (MTP) from the main llama.cpp branch. They report solid performance: 175–200k context at Q8 KV quantization, 200–300 prompt processing tokens per second, and 25–40 generation tokens per second depending on draft hit rates.

The key innovation is that sub-agents run without forcing a full reprocess of the main context after they finish. Currently, the fork saves and loads slots via `--slot-save-path`, though the resulting `.bin` files are large. The developer plans to add an option to spawn sub-agents with no prior context to save VRAM. This work is especially relevant for anyone running pi coding agent as a harness with a single local LLM who wants sub-agent collaboration without expensive hardware.

Key Points
  • Designed for 10GB VRAM and a single LLM slot via llama.cpp server
  • Uses Qwen3.6-35b-A3B with Q8 KV cache for 175–200k context
  • Achieves 200–300 pp and 25–40 tps with multi-token prediction enabled

Why It Matters

Enables local multi-agent coding workflows on modest consumer GPUs, democratizing sub-agent use without cloud costs.