llama.cpp adds llama-eval for local model benchmarking with AIME, GSM8K
Evaluate and compare LLM quantizations and finetunes at home using standard datasets.
Deep Dive
Now you can evaluate your models at home — a tool to compare quants and finetunes. Datasets include AIME, AIME2025, GSM8K, and GPQA.
Key Points
- Pull Request #21152 by ggerganov adds llama-eval to the llama.cpp project.
- Supports evaluation datasets: AIME, AIME2025, GSM8K, and GPQA.
- Enables local comparison of model quantizations and finetunes without cloud dependency.
Why It Matters
Democratizes model benchmarking, enabling reproducible, offline evaluation for developers and researchers.