Open Source

llama.cpp adds llama-eval for local model benchmarking with AIME, GSM8K

Evaluate and compare LLM quantizations and finetunes at home using standard datasets.

Deep Dive

Now you can evaluate your models at home — a tool to compare quants and finetunes. Datasets include AIME, AIME2025, GSM8K, and GPQA.

Key Points
  • Pull Request #21152 by ggerganov adds llama-eval to the llama.cpp project.
  • Supports evaluation datasets: AIME, AIME2025, GSM8K, and GPQA.
  • Enables local comparison of model quantizations and finetunes without cloud dependency.

Why It Matters

Democratizes model benchmarking, enabling reproducible, offline evaluation for developers and researchers.