Serving 1B+ tokens/day locally in my research lab
A university hospital lab achieves massive throughput of 1B tokens/day locally using a custom vLLM stack.
A research lead at a university hospital has successfully configured a high-performance internal LLM server capable of processing over 1 billion tokens daily. The system, built on two NVIDIA H200 GPUs, serves the GPT-OSS-120B model using the vLLM inference engine and an MXFP4 quantized format, which proved crucial for performance. This configuration achieves a sustained decode speed of approximately 220 tokens per second, significantly outpacing other tested models like Qwen and GLM-Air on the same hardware. The lab processes large volumes of clinical data, with about two-thirds of the token volume dedicated to ingestion tasks and one-third to generation, all while maintaining reliable JSON output formatting and tool-calling capabilities essential for medical data structuring.
The architecture employs a Docker-based microservices stack with LiteLLM providing an OpenAI-compatible proxy API, handling routing, rate limiting, and a priority queue across the two vLLM instances (one per GPU). Monitoring is handled by Prometheus scraping metrics every 5 seconds, visualized in Grafana, with usage data logged to PostgreSQL. The lead selected GPT-OSS-120B after extensive testing, favoring its verified performance on published benchmarks and its optimized MXFP4 quantization for H200 GPUs, which delivered nearly double the throughput of alternative formats like NVFP4 or GGUF. While eyeing future models like Mistral Small 4, the current demand is too high to take the system offline, proving its value as a critical, scalable research infrastructure.
- Processes over 1 billion tokens per day (2/3 ingestion, 1/3 decode) on two H200 GPUs
- Achieves ~220 tokens/second decode speed using GPT-OSS-120B with MXFP4 quantization on vLLM
- Production stack includes LiteLLM proxy, PostgreSQL, Prometheus/Grafana monitoring, all containerized with Docker
Why It Matters
Proves viable, high-throughput local AI deployment for sensitive domains like healthcare, reducing cloud costs and data privacy risks.