Research & Papers

Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

arXiv cs.DC February 20, 2026

⚡New method splits Mistral 7B across local/cloud GPUs, achieving 9.3 tokens/sec over 80ms WAN links.

Deep Dive

Michael Cunningham presents a privacy-aware split inference system for LLMs. It divides transformer layers between a trusted local device and untrusted cloud, using speculative decoding to overcome WAN latency. The method achieves 8.7-9.3 tok/s on Mistral 7B over 80ms links while protecting raw tokens. With an 8-layer split, token recovery attacks drop to ~35%. It requires only 4.9GB local VRAM for 12B models, matching 7B throughput.

Why It Matters

Enables enterprises to use powerful cloud LLMs for sensitive data without sacrificing performance or privacy.

Read Original Article

Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

Why It Matters

Stay Ahead in AI