Achieved ~2000 TPS using Qwen 3.5 27B quantized model (Q5_K_XL) on RTX 5090 hardware?

Achieved ~2000 TPS using Qwen 3.5 27B quantized model (Q5_K_XL) on RTX 5090 hardware.

Processed 1.2M input tokens to generate 815 output tokens, classifying 320 docs in 10 minutes?

Processed 1.2M input tokens to generate 815 output tokens, classifying 320 docs in 10 minutes.

Optimizations included a 128k context window, batch size of 8, and disabling unused vision modules?

Optimizations included a 128k context window, batch size of 8, and disabling unused vision modules.

Open Source

User achieves 2000 TPS with Qwen 3.5 27B on RTX 5090 for document classification

r/LocalLLaMA March 14, 2026

⚡A developer hit 2000 tokens per second by fine-tuning Qwen 3.5 27B for a high-throughput classification task.

Deep Dive

A developer has demonstrated a breakthrough in inference speed, achieving a throughput of approximately 2000 tokens per second (TPS) using Alibaba's Qwen 3.5 27B model. The test involved a specialized document classification task, processing 1,214,072 input tokens to produce only 815 output tokens, successfully classifying 320 unique markdown documents in a 10-minute window. The setup utilized the unsloth/Qwen3.5-27B-UD-Q5_K_XL.gguf quantized model running on the official llama.cpp server-cuda13 image, powered by an RTX 5090 GPU. The key to this performance was a highly optimized configuration tailored for a specific, high-volume workflow where each document is distinct, eliminating caching benefits.

Critical optimizations included disabling unused vision modules, enforcing a "no thinking" mode to prevent speculative execution, and carefully managing VRAM to fit the entire context during inference. The developer reduced the context window to 128k and set the parallelism equal to a batch size of 8, allocating 16k of context per request. This configuration allowed the system to process the vast majority of documents at extreme speed, kicking out only the largest 1% for separate handling. While these numbers are situational and not from formal benchmarks, they showcase the potential for custom-tuned, quantized models to achieve unprecedented throughput for targeted enterprise applications like bulk document analysis and classification.

Key Points

Achieved ~2000 TPS using Qwen 3.5 27B quantized model (Q5_K_XL) on RTX 5090 hardware.
Processed 1.2M input tokens to generate 815 output tokens, classifying 320 docs in 10 minutes.
Optimizations included a 128k context window, batch size of 8, and disabling unused vision modules.

Why It Matters

Shows how specialized tuning can unlock order-of-magnitude speed gains for high-volume, real-world AI tasks like document processing.

Read Original Article

User achieves 2000 TPS with Qwen 3.5 27B on RTX 5090 for document classification

Why It Matters

Related Articles

🚀 Stay Ahead in AI