Open Source

Friendly reminder inference is WAY faster on Linux vs windows

A developer's home lab PC generated tokens 72-118% faster on Ubuntu versus Windows 10.

Deep Dive

A developer's weekend experiment with the popular Ollama platform has revealed a stark performance divide between operating systems for local AI inference. Using a home lab PC equipped with an NVIDIA RTX 8000 GPU (48GB VRAM), 64GB of DDR4 RAM, and a Core i9-9900k CPU, they benchmarked identical models on a fresh Windows 10 install versus their usual Ubuntu 22.04 LTS setup. The results were unambiguous: inference speed, measured in tokens per second (t/s), was consistently and significantly faster on Linux.

Testing the QWEN Code Next model at 4-bit quantization (q4) with a 6k context length, Linux produced 31 t/s versus 18 t/s on Windows—a 72% increase. The gap widened with the larger QWEN 3 30B model, where Linux achieved 105 t/s compared to Windows' 48 t/s, representing a massive 118% performance uplift. This suggests the performance delta is not merely marginal but potentially transformative for developers iterating on models or researchers running local experiments, where faster iteration directly translates to productivity gains.

The finding underscores a critical, often overlooked factor in the AI development stack: the foundational software layer. While users often focus on GPU specs or model architecture, this test indicates that the choice of operating system and its underlying drivers and scheduler can be a major bottleneck or a key accelerator. For professionals building or testing AI applications locally, switching to a Linux environment could effectively double their hardware's capabilities without any additional financial investment, making it a high-impact, low-cost optimization.

Key Points
  • The QWEN 3 30B model ran at 105 tokens/second on Linux vs. 48 t/s on Windows, a 118% performance increase.
  • Testing was done on identical hardware (RTX 8000 GPU, 64GB RAM) using the latest Ollama platform.
  • The performance gap highlights a major software-layer optimization opportunity for AI developers and researchers.

Why It Matters

For developers running local LLMs, choosing Linux over Windows can effectively double inference speed, drastically cutting iteration time at zero hardware cost.