Open Source

Club-5060ti: Tested RTX 5060 Ti configs for Qwen 27B local LLMs

Two RTX 5060 Ti 16GB GPUs serving Qwen3.6 27B at NVFP4 with long-context up to 204800

Deep Dive

The club-5060ti repo, inspired by club-3090, focuses on reproducing and sharing exact local LLM setups for the RTX 5060 Ti 16GB. The current seed configuration uses two RTX 5060 Ti cards on Linux, running Qwen3.6 27B in two serving stacks: vLLM with NVFP4 mixed-precision and multi-token prediction (MTP), and llama.cpp with MTP GGUF at Q4 and Q6 quantization levels. For users needing extreme context windows, a Q6 long-context preset supports up to 204800 tokens, while a safer 65536 token preset is provided for llama.cpp to leave headroom. Initial tests also cover the larger Qwen3.6 35B A3B model on both vLLM and llama.cpp.

The repo goes beyond vague tokens-per-second claims by including sanitized launch examples, helper scripts for model downloads and llama.cpp updates, and simple OpenAI-compatible smoke tests and benchmarking scripts. All results are documented in CSV seed result templates with exact software versions, KV cache settings, and caveats. The aim is to let anyone with similar hardware precisely reproduce and validate these configurations. Contributions via issues or PRs are welcome, provided they include enough detail to replicate the result. This makes the repo a valuable resource for professionals building affordable, high-performance local inference rigs.

Key Points
  • Dual RTX 5060 Ti 16GB on Linux running Qwen3.6 27B via vLLM NVFP4/MTP and llama.cpp Q4/Q6 GGUF.
  • Long-context presets up to 204800 tokens (Q6) and a safer 65536 token llama.cpp router preset.
  • Includes sanitized launch configs, model download scripts, and CSV seed results for exact reproducibility.

Why It Matters

Enables professionals to run advanced open-source LLMs locally on affordable dual-RTX 5060 Ti hardware with verified, reproducible configurations.