b8906
Claude Code now works properly with llama.cpp server thanks to a clever checksum workaround.
llama.cpp version b8906 is now available, addressing a critical prefix caching bug in the Anthropic API integration. The issue caused the server to use only n_past 18577 tokens for caching, even when the actual context exceeded 60,000 tokens, significantly degrading performance for users running Claude Code against llama.cpp. The root cause was a changing checksum in the x-anthropic-billing-header system message, which prevented proper caching.
The fix replaces the variable 5-character hexadecimal checksum with 'fffff', enabling consistent prefix caching across requests. This workaround treats the checksum as an Anthropic message body API detail, with defensive coding to handle potential protocol changes. The release includes builds for macOS (Apple Silicon, Intel, iOS), Linux (multiple architectures), Windows (CPU, CUDA, Vulkan, HIP), and Android.
- Fixed prefix caching for Anthropic API, resolving issue where only n_past 18577 was used despite 60k+ context
- Replaced changing x-anthropic-billing-header checksum with 'fffff' to enable caching
- Available for macOS, Linux, Windows, and Android with multiple backend options
Why It Matters
Enables efficient Claude Code usage with llama.cpp, reducing latency for large context AI applications.