CAMMSR: Category-Guided Attentive Mixture of Experts for Multimodal Sequential Recommendation
New model dynamically weights text, images, and video based on user context and item categories.
A research team led by Jinfeng Xu has introduced CAMMSR, a novel AI architecture designed to revolutionize how recommendation systems process multimodal data like text, images, and video. Accepted for publication at the prestigious ICDE 2026 conference, the model addresses a core limitation in current systems: their reliance on static, heuristic methods to fuse different data types. CAMMSR recognizes that a user's preference for an item's image versus its description isn't fixed; it changes based on the item's category and the user's own evolving interests. This allows for a more nuanced, user-centric approach to content discovery beyond what single-modality or rigidly fused models can achieve.
The technical breakthrough is the Category-guided Attentive Mixture of Experts (CAMoE) module, which learns specialized representations from multiple perspectives and explicitly models inter-modal synergies. It dynamically allocates importance to different data streams, guided by an auxiliary task that predicts item categories. Additionally, the team employs a modality swap contrastive learning task to improve alignment between different data types through sequence-level augmentation. Extensive testing on four public benchmarks shows CAMMSR consistently outperforms existing state-of-the-art models. This paves the way for the next generation of recommendation engines on streaming, e-commerce, and social platforms that can intelligently adapt which product features—a video trailer, a review snippet, or a product image—to emphasize for each individual user.
- Introduces a Category-guided Attentive Mixture of Experts (CAMoE) module for dynamic, context-aware fusion of text, image, and video data.
- Outperforms existing state-of-the-art models on four public datasets, validating its adaptive and synergistic approach.
- Uses an auxiliary category prediction task and a modality swap contrastive learning task to guide fusion and improve cross-modal alignment.
Why It Matters
Enables more personalized and effective recommendations on major platforms by understanding how users truly engage with multimedia content.