Open Source

Ornith-1.0-35B quantized to Q3_K_M fits 17GB VRAM, passes behavior suite

A 3.87 bits-per-weight quant runs 240 tok/s on a single GPU—half the VRAM of Q8.

Deep Dive

A community quantizer has released a highly efficient 3.87 bits-per-weight (BPW) version of the 35B-parameter Ornith-1.0 model, called Q3_K_M, that fits comfortably on a single consumer GPU with ~17 GB VRAM. The quantizer used llama-quantize from the upstream BF16 GGUF, reducing the model from 16.01 BPW to 3.87 BPW. The resulting file is 16.8 GB on disk and loads about 17 GiB in VRAM—roughly 21% smaller than the Q4_K_M variant. To validate quality, the author built a corrected top-64 next-token KL divergence probe (against BF16) over 32 coding prompts. The Q3_K_M achieves a mean KL divergence of 0.366, with 84.4% top-1 token agreement, compared to 0.086 and 90.6% for Q4_K_M. Higher quants (Q5, Q6, Q8) show better fidelity but require significantly more VRAM (up to 36.9 GB for Q8_0). Performance benchmarks on a single GPU using llama.cpp CUDA server show ~240 tok/s single-stream, scaling to ~493 tok/s with 16 concurrent slots, and p95 time-to-first-token (TTFT) of ~78 ms at concurrency 1.

The release includes a full suite of quantizations (Q3 through Q8) mirrored from upstream plus the new Q3, all validated against the same 14/14 behavior suite. Notably, the author found and fixed a reasoning-mode bug in llama.cpp where short coding prompts could exhaust the response budget parsing reasoning content, leaving empty final content; serving scripts now default to REASONING=off to pass the full suite. Additional contributions include a single-step LoRA SFT smoke test (no fine-tuned adapter yet) and OpenAI-compatible correctness gates for serving endpoints. The quantizer warns that vLLM had a broken GGUF path with corrupted Q4_K_M output, recommending llama.cpp for these files. The repo is hosted on Hugging Face at LordNeel/Ornith-1.0-35B-GGUF-llamacpp-tp1, and the author is working on quants for a 397B model and performance improvements for existing ones.

Key Points
  • Ornith-1.0-35B Q3_K_M: 3.87 BPW, 16.8 GB disk, ~17 GB VRAM—21% smaller than Q4_K_M and fits single GPU.
  • Mean KL divergence vs BF16: 0.366; top-1 token match: 84.4% (vs 90.6% for Q4_K_M, 100% for Q6_K).
  • Performance: 240 tok/s single-stream, up to 493 tok/s at 16 concurrent slots, p95 TTFT 78 ms on llama.cpp CUDA.

Why It Matters

Enables running a capable 35B model on affordable single-GPU setups with good speed and acceptable quality trade-offs.

📬 Get the top 10 AI stories daily