TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition
New tool shows CPU overhead can dominate LLM latency, with MoE models 8-11x worse than dense models.
A team of researchers from Carnegie Mellon University and NVIDIA has published a paper titled 'TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition.' The work introduces a new diagnostic methodology that breaks down the often-hidden 'host-side' orchestration overhead during Large Language Model inference into three measurable components: framework translation time, CUDA library translation time, and kernel launch-path time. This granular breakdown is crucial because in latency-sensitive deployments—common for interactive AI assistants and agentic systems—this overhead can dominate total response time, but existing metrics often lump it together as an unactionable aggregate.
TaxBreak was validated on NVIDIA H100 and H200 GPU systems. A key finding is that for complex models like Mixture-of-Experts (MoE), the problem is severe: they dispatch 8-11x more kernels per output token than dense models, making them exceptionally host-bound. The research introduces a Host-Device Balance Index (HDBI) to summarize this relationship. Crucially, it demonstrates that simply looking at GPU utilization or total latency can misdirect optimization efforts. The study proves that for these workloads, CPU single-thread performance is a first-order parameter; upgrading to a faster host CPU reduced orchestration overhead by 10-29% and improved end-to-end latency by up to 14%, even when paired with a slower-clocked GPU.
The tool's primary value is as a diagnostic. It clearly distinguishes cases where engineers should focus on optimizing the software stack (e.g., framework efficiency) from cases where the real gains lie in reducing the device-side workload itself. This prevents wasted engineering effort. The paper, accepted at IEEE ISPASS 2026, provides a concrete framework for AI infrastructure teams to systematically identify and eliminate the hidden 'taxes' slowing down their LLM deployments, moving beyond guesswork to data-driven optimization.
- TaxBreak decomposes LLM inference overhead into 3 measurable parts: framework, CUDA library, and kernel launch-path time.
- Mixture-of-Experts (MoE) models dispatch 8-11x more kernels per token than dense models, making them highly susceptible to CPU bottlenecks.
- A faster host CPU reduced orchestration overhead by 10-29% and improved end-to-end latency by up to 14% in host-bound scenarios.
Why It Matters
Helps AI infrastructure teams pinpoint whether to optimize software or hardware, preventing wasted effort and speeding up real-world LLM applications.