Developer Tools

b8906

llama.cpp Releases April 24, 2026

⚡Claude Code now works properly with llama.cpp server thanks to a clever checksum workaround.

Deep Dive

llama.cpp version b8906 is now available, addressing a critical prefix caching bug in the Anthropic API integration. The issue caused the server to use only n_past 18577 tokens for caching, even when the actual context exceeded 60,000 tokens, significantly degrading performance for users running Claude Code against llama.cpp. The root cause was a changing checksum in the x-anthropic-billing-header system message, which prevented proper caching.

The fix replaces the variable 5-character hexadecimal checksum with 'fffff', enabling consistent prefix caching across requests. This workaround treats the checksum as an Anthropic message body API detail, with defensive coding to handle potential protocol changes. The release includes builds for macOS (Apple Silicon, Intel, iOS), Linux (multiple architectures), Windows (CPU, CUDA, Vulkan, HIP), and Android.

Key Points

Fixed prefix caching for Anthropic API, resolving issue where only n_past 18577 was used despite 60k+ context
Replaced changing x-anthropic-billing-header checksum with 'fffff' to enable caching
Available for macOS, Linux, Windows, and Android with multiple backend options

Why It Matters

Enables efficient Claude Code usage with llama.cpp, reducing latency for large context AI applications.

Read Original Article

b8906

Why It Matters

Stay Ahead in AI