Open Source

Are you quanting your memory?

Reddit engineers debate memory formats: BF16, Q8, Q4, and turboquant impact speed, memory, and model quality.

Deep Dive

A Reddit user is curious how people handle KV cache with formats like BF16, Q8, Q4, TurboQuant, or other methods. They run BF16 hoping for fewer hallucinations and because that's what their g4 and q3.6 models were trained on, but want to hear if others have good results with Q8, Q4, or turbo3/4.

Key Points
  • BF16 offers highest quality and lowest hallucination risk because models are natively trained on it.
  • Q8 reduces KV cache memory by ~50% with minimal quality loss, making it the most popular trade-off.
  • Q4 and turbo3/4 can halve memory further but risk increased artifacts in long-context and code tasks.

Why It Matters

Choosing the right KV cache quantization can cut inference costs by 2x or more without crashing model quality.