Research & Papers

StrLoRA: New method lets MLLMs learn from mixed, evolving data streams

Continuous learning without forgetting—even when tasks arrive as a chaotic mix.

Deep Dive

Continual learning is crucial for Multimodal Large Language Models (MLLMs), but existing methods assume a clean separation of tasks—each training phase contains exactly one. Real-world data, however, arrives as a messy, interleaved stream of diverse tasks. To address this, researchers introduce Streaming Continual Visual Instruction Tuning (StrCVIT), a setting where models must simultaneously acquire new skills, strengthen previously seen abilities, and avoid catastrophic forgetting—all from a dynamic stream of data chunks that each contain a mixture of tasks. Current CVIT techniques fail here because they cannot reliably distinguish samples belonging to different tasks within a single chunk.

Enter StrLoRA. The framework employs a regularized two-stage expert routing mechanism. First, it selects a sparse set of relevant experts based on the textual instruction, cutting down cross-task interference. Then, it applies token-wise weighting within that subset, using cross-modal attention between local visual tokens and a global instruction representation to assign contribution weights. To keep the model stable on a non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average. Experiments on a newly created StrCVIT benchmark show StrLoRA substantially outperforms existing continuous learning baselines, effectively enabling MLLMs to learn from continuously evolving data streams.

Key Points
  • Introduces StrCVIT, a more realistic continual learning setting where training data arrives as interleaved, dynamic task mixtures.
  • StrLoRA uses task-aware sparse expert selection and token-wise cross-modal weighting to reduce interference and boost recurring skills.
  • A routing-stability regularization with exponential moving average prevents forgetting during non-stationary stream training.

Why It Matters

Brings MLLM training closer to real-world data flows, enabling models that learn continuously without manual task separation.