Research & Papers

Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference

New scheduler reduces time-to-first-token by 78.5% for latency-critical multimodal AI requests.

Deep Dive

Researchers Konstantinos Papaioannou and Thaleia Dimitra Doudali have introduced RPS-Serve, a novel scheduling system designed to solve the performance bottlenecks plaguing Multimodal Large Language Models (MLLMs). MLLMs like GPT-4, Gemini, and Copilot process text, images, and videos, but existing inference systems—optimized for text-only workloads—fail under this mixed load. Large video requests act as 'rocks,' monopolizing GPU memory and compute, causing severe head-of-line blocking that delays all subsequent requests, including simple text queries. This degradation ruins the interactive experience users expect from modern AI assistants.

RPS-Serve's key innovation is its 'rocks, pebbles, and sand' abstraction, which dynamically classifies incoming requests by their resource intensity. It then applies a modality-aware scheduling policy that allows lightweight 'sand' (text) and 'pebbles' (images) to flow quickly around heavy 'rocks' (videos). The system uses dynamic prioritization and aging techniques to ensure large requests are not starved of resources. In evaluations across state-of-the-art MLLMs, this approach delivered dramatic improvements: reducing the average time-to-first-token (TTFT) by 54% and slashing TTFT for latency-critical requests by 78.5%.

The research, detailed in an arXiv preprint, addresses a critical scaling challenge as AI platforms become increasingly multimodal. By making more efficient use of available hardware resources, RPS-Serve promises to restore LLM-like responsiveness to systems burdened by diverse media inputs. This work provides a foundational scheduling framework that could be integrated into future inference servers, directly impacting the performance and user experience of popular AI platforms that handle millions of mixed-media requests daily.

Key Points
  • RPS-Serve classifies MLLM requests as 'rocks' (videos), 'pebbles' (images), or 'sand' (text) based on resource demand.
  • The scheduler reduces average time-to-first-token by 54% and improves latency-critical response by 78.5% versus current systems.
  • It prevents head-of-line blocking by allowing smaller requests to bypass larger ones, ensuring interactive responsiveness for all users.

Why It Matters

Enables faster, more responsive AI assistants like ChatGPT and Copilot when processing images and videos alongside text.