Discrete Preference Learning for Personalized Multimodal Generation
A new two-stage framework turns user interactions into discrete tokens for consistent, personalized multimodal content.
A research team from multiple institutions has introduced a novel AI framework called DPPMG (Discrete Preference Learning for Personalized Multimodal Generation) that addresses critical limitations in current personalized generative models. The system tackles two main challenges: the gap between continuous user preferences and the discrete token inputs required by generator architectures like GPT-4 and Stable Diffusion, and the potential inconsistency between generated images and texts. Their solution involves a two-stage approach where a modal-specific graph neural network first learns users' preferences from their multimodal interactions, then quantizes these preferences into discrete tokens that can be injected into downstream text and image generators.
To ensure the generated content remains both personalized and consistent across modalities, the researchers designed a cross-modal consistent and personalized reward mechanism to fine-tune the token-associated parameters during training. This approach allows the system to maintain individual user preferences while ensuring that generated images and text align logically. The framework has been validated through extensive experiments on two real-world datasets, demonstrating significant improvements in generating personalized and coherent multimodal content compared to existing methods. The paper has been accepted for publication at SIGIR 2026, a premier conference in information retrieval, indicating its potential impact on next-generation recommendation systems and personalized content creation tools.
- Uses a modal-specific graph neural network to learn user preferences from multimodal interactions and quantizes them into discrete tokens
- Addresses the architecture gap by converting continuous preferences to discrete tokens compatible with generators like GPT and Stable Diffusion
- Implements a cross-modal consistency reward to fine-tune parameters, ensuring personalized yet coherent text-image outputs
Why It Matters
Enables AI systems to generate truly personalized, consistent text and images together, advancing recommendation engines and creative tools.