Research & Papers

MARLaaS cuts RL fine-tuning time 85% with multi-tenant design

New system lets multiple teams fine-tune LLMs concurrently with near-zero idle time.

Deep Dive

Fine-tuning large language models with reinforcement learning from verifiable rewards (RLVR) is computationally expensive, limiting access to well-resourced teams. To address this, researchers propose MARLaaS (Multi-tenant Asynchronous RL as a Service), a system designed for concurrent RL fine-tuning across multiple users and tasks. MARLaaS is built on two core ideas: sharing a single base model across tenants using lightweight LoRA adapters, and a disaggregated asynchronous architecture that separates rollout generation, environment interaction, and policy training into independently scheduled stages. This event-driven design allows each task to progress at its own pace, reducing cross-task interference and idle time.

In experiments with up to 32 concurrent tasks, MARLaaS achieved single-task state-of-the-art performance while improving accelerator utilization by 4.3x and slashing end-to-end training time by 85%. The system’s architecture enables efficient resource sharing without sacrificing individual task quality, making RL-based fine-tuning far more accessible for multi-agent, tool-use, and complex reasoning scenarios. MARLaaS represents a practical step toward democratizing RL for LLMs, particularly in resource-constrained environments.

Key Points
  • Shares a base model across tenants via lightweight LoRA adapters, minimizing memory overhead.
  • Disaggregated asynchronous architecture decouples rollout, environment interaction, and policy training for independent scheduling.
  • Achieves 4.3x accelerator utilization improvement and 85% end-to-end training time reduction with up to 32 concurrent tasks.

Why It Matters

Makes reinforcement learning fine-tuning accessible and cost-effective for more teams and complex agentic applications.