Research & Papers

CoLLM: A Unified Framework for Co-execution of LLMs Federated Fine-tuning and Inference

New system runs fine-tuning and inference simultaneously on edge devices, achieving up to 3x higher throughput.

Deep Dive

A research team led by Shaoyuan Huang has introduced CoLLM, a novel framework that fundamentally changes how Large Language Models are deployed and improved at the edge. Traditional approaches treat fine-tuning and inference as separate processes, requiring duplicate resources and creating delays between model improvements and their deployment. CoLLM solves this by enabling simultaneous federated parameter-efficient fine-tuning (FL PEFT) and inference on shared edge replicas, allowing models to continuously learn from user interactions while serving requests.

The system employs two key innovations: an intra-replica model sharing mechanism that enables real-time parameter reuse through unmerged inference and shadow adapter strategies, and a two-timescale inter-replica coordination algorithm that dynamically balances fine-tuning and inference workloads. This dual approach optimizes both long-term model quality gains and short-term inference efficiency. Extensive evaluations across diverse LLMs and real-world traces demonstrate CoLLM consistently outperforms state-of-the-art systems, achieving up to 3x higher goodput—a critical metric combining throughput and quality of service.

For edge intelligence applications, this means domain-specific models can now improve continuously through federated learning while maintaining low-latency inference for end users. The framework addresses the growing need for efficient LLM deployment in resource-constrained environments, from personalized AI assistants on mobile devices to specialized industrial applications. By eliminating the traditional separation between training and serving phases, CoLLM represents a significant step toward more adaptive and efficient edge AI systems.

Key Points
  • Unifies federated fine-tuning and inference on shared edge replicas, eliminating redundant deployments
  • Achieves up to 3x higher goodput compared to state-of-the-art LLM systems in evaluations
  • Uses two-timescale coordination to balance long-term model quality with short-term inference efficiency

Why It Matters

Enables continuous model improvement while serving users on edge devices, crucial for personalized and domain-specific AI applications.