Research & Papers

Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads

Researchers' open-source shim combines local models with cloud LLMs to dramatically reduce coding agent costs.

Deep Dive

A research team from Kwame Nkrumah University of Science and Technology has published a comprehensive study called Local-Splitter that systematically measures seven tactics for reducing cloud LLM token consumption in coding-agent workloads. The approach uses a small local model as a triage layer in front of a more powerful (and expensive) frontier cloud model. The researchers implemented all seven tactics in an open-source shim that speaks both MCP (Model Context Protocol) and OpenAI-compatible HTTP surfaces, supporting any local model via Ollama and any cloud model through standard endpoints.

The seven tactics studied include: local routing (deciding which queries stay local), prompt compression, semantic caching, local drafting with cloud review, minimal-diff edits, structured intent extraction, and batching with vendor prompt caching. The team evaluated each tactic individually, in pairs, and in greedy-additive combinations across four distinct coding workload classes: edit-heavy, explanation-heavy, general chat, and RAG-heavy (retrieval-augmented generation). They measured tokens saved, dollar cost, latency, and routing accuracy to provide practical guidance.

The headline finding shows that combining T1 (local routing) with T2 (prompt compression) achieves dramatic cloud token savings of 45-79% on edit-heavy and explanation-heavy workloads. For RAG-heavy workloads, the full tactic set including T4 (draft-review) achieves 51% savings. Crucially, the study demonstrates that the optimal tactic subset is workload-dependent, providing practitioners with actionable intelligence for deploying cost-effective coding agents. The open-source implementation means teams can immediately apply these findings to their own AI-assisted development workflows.

Key Points
  • Local routing plus prompt compression saves 45-79% of cloud tokens on edit/explanation coding tasks
  • Open-source shim supports Ollama for local models and any OpenAI-compatible endpoint for cloud LLMs
  • Optimal cost-saving tactic combination varies by workload type (edit-heavy, RAG-heavy, etc.)

Why It Matters

Enables development teams to run AI coding assistants at significantly lower cost while maintaining performance.