Open Source

2000 TPS with QWEN 3.5 27b on RTX-5090

A developer hit 2000 tokens per second by fine-tuning Qwen 3.5 27B for a high-throughput classification task.

Deep Dive

A developer has demonstrated a breakthrough in inference speed, achieving a throughput of approximately 2000 tokens per second (TPS) using Alibaba's Qwen 3.5 27B model. The test involved a specialized document classification task, processing 1,214,072 input tokens to produce only 815 output tokens, successfully classifying 320 unique markdown documents in a 10-minute window. The setup utilized the unsloth/Qwen3.5-27B-UD-Q5_K_XL.gguf quantized model running on the official llama.cpp server-cuda13 image, powered by an RTX 5090 GPU. The key to this performance was a highly optimized configuration tailored for a specific, high-volume workflow where each document is distinct, eliminating caching benefits.

Critical optimizations included disabling unused vision modules, enforcing a "no thinking" mode to prevent speculative execution, and carefully managing VRAM to fit the entire context during inference. The developer reduced the context window to 128k and set the parallelism equal to a batch size of 8, allocating 16k of context per request. This configuration allowed the system to process the vast majority of documents at extreme speed, kicking out only the largest 1% for separate handling. While these numbers are situational and not from formal benchmarks, they showcase the potential for custom-tuned, quantized models to achieve unprecedented throughput for targeted enterprise applications like bulk document analysis and classification.

Key Points
  • Achieved ~2000 TPS using Qwen 3.5 27B quantized model (Q5_K_XL) on RTX 5090 hardware.
  • Processed 1.2M input tokens to generate 815 output tokens, classifying 320 docs in 10 minutes.
  • Optimizations included a 128k context window, batch size of 8, and disabling unused vision modules.

Why It Matters

Shows how specialized tuning can unlock order-of-magnitude speed gains for high-volume, real-world AI tasks like document processing.