Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix)
A single 3090 can now handle 218K context and complex tool agents without crashing
A developer has demonstrated that Qwen3.6-27B, a 27B-parameter open-source language model, can run on a single RTX 3090 with a context length of ~218K tokens and throughput of 50–66 tokens per second (TPS) for text, and ~198K with vision at similar speeds. More importantly, the setup now handles tool-agent workloads that produce 25K-token outputs without running out of memory, a critical milestone for local AI agents.
The key breakthrough was debugging the Genesis PN12 patch on vLLM dev205+. The patch was supposed to mitigate memory pressure but wasn't applying due to anchor drift in the code path. After fixing that, the tool-prefill OOM errors disappeared, enabling both high context and stable execution. Limitations remain: a second memory cliff appears around 50–60K tokens for single-prompt workloads on one GPU, but that doesn't occur with tensor parallelism across two 3090s. Performance depends heavily on quantization and configuration, but the results show that consumer-grade hardware can now handle advanced agentic tasks.
- Achieves ~218K context at 50–66 TPS on a single RTX 3090 for text; ~198K with vision at 51–68 TPS
- Tool calls with 25K-token outputs now stable after fixing anchor drift in the Genesis PN12 patch on vLLM dev205+
- A memory cliff remains at ~50–60K for single-prompt workloads on one GPU, but avoided with multi-GPU setups
Why It Matters
Brings high-context agentic AI workloads to consumer GPUs, enabling local tool-use agents with large memory