Research & Papers

We open-sourced Chaperone-Thinking-LQ-1.0 — a 4-bit GPTQ + QLoRA fine-tuned DeepSeek-R1-32B that hits 84% on MedQA in ~20GB[N]

A 4-bit quantized DeepSeek-R1-32B hits near GPT-4o accuracy for healthcare reasoning.

Deep Dive

EmpirischTech has released Chaperone-Thinking-LQ-1.0 on Hugging Face, a reasoning model built on DeepSeek-R1-Distill-Qwen-32B that pushes the boundaries of efficient deployment. The pipeline combines 4-bit GPTQ quantization to shrink the model from ~60GB to ~20GB, quantization-aware training (QAT) via GPTQ with calibration to minimize accuracy loss, and QLoRA fine-tuning on medical and scientific corpora. Notably, the team removed the adaptive identity layer for transparency, ensuring the model correctly attributes its architecture to DeepSeek's original work. The result is a compact model that delivers 84% on MedQA, within 4 points of GPT-4o's ~88%, while fitting on a single L40/L40s GPU.

Benchmark results show competitive performance across the board: 91.9% on MATH-500, 85.9% on MMLU, 66.7% on AIME 2024, and 56.7% on GPQA Diamond. The model achieves 36.86 tok/s throughput versus 22.84 tok/s for the base DeepSeek-R1-32B, a 1.6x speedup with ~43% lower median latency. EmpirischTech developed this for enterprise healthcare clients with strict data sovereignty requirements, enabling on-prem reasoning without API calls to OpenAI. The CC-BY-4.0 license allows broad use, making it a practical option for organizations needing high-performance AI under tight data control.

Key Points
  • Compresses DeepSeek-R1-32B from ~60GB to ~20GB via 4-bit GPTQ quantization
  • Achieves 84% on MedQA, within 4 points of GPT-4o (~88%)
  • Runs 1.6x faster at 36.86 tok/s with ~43% lower latency than base model

Why It Matters

Enables enterprise healthcare to run near-frontier reasoning on-prem, preserving data sovereignty without sacrificing performance.