Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder
A single dense Transformer model processes three modalities, slashing memory use without needing paired data.
A research team led by Kin Wai Lau has introduced Omni-C (Omni-Compress), a novel architecture designed to unify multimodal AI processing. The model addresses a critical inefficiency in current systems: the reliance on separate, expert encoders for each data type (like images, audio, and text), which causes computational overhead and complexity to scale linearly with added modalities. While recent unified 'Omni-models' use Mixture-of-Expert (MoE) architectures, they still inflate parameter counts and introduce routing overhead. Omni-C's breakthrough is a single, dense Transformer-based backbone that learns shared representations across all three modalities.
Omni-C is trained using unimodal contrastive learning on large-scale, unaligned datasets, meaning it doesn't require perfectly matched image-audio-text pairs for supervision. It maximizes parameter sharing in its core encoder and uses only small, modality-specific projection heads to adapt outputs, effectively mitigating conflicts between different data types. This design allows for sequential processing of modalities, eliminating the need to load multiple experts in parallel. The result is a model that performs competitively with specialist models on unimodal and cross-modal tasks, with only modest zero-shot performance drops on audio and text that can be recovered through lightweight fine-tuning. Crucially, it achieves this while "substantially" reducing inference memory usage compared to multi-encoder baselines, paving the way for more efficient and scalable multimodal AI on standard hardware.
- Uses a single dense Transformer encoder to process images, audio, and text, avoiding the linear scaling complexity of separate experts.
- Trained via unimodal contrastive pretraining on unaligned data, eliminating the need for costly paired supervision or complex MoE routing.
- Achieves comparable performance to expert models while enabling low-memory, sequential inference, ideal for memory-constrained systems.
Why It Matters
It enables powerful multimodal AI to run efficiently on standard hardware, reducing costs and barriers to deployment for real-world applications.