Known Intents, New Combinations: Clause-Factorized Decoding for Compositional Multi-Intent Detection
A new lightweight decoder trained only on single intents achieves 97.5% accuracy on unseen multi-intent pairs.
Researcher Abhilash Nandy has published a paper introducing a new approach to a critical problem in conversational AI: teaching models to correctly understand novel combinations of user intents. Most virtual assistants and chatbots are trained to recognize predefined combinations of requests, like "play music and set a timer." They fail when users combine known intents in new ways not seen in training data. Nandy argues that existing benchmarks weakly test this 'compositional generalization' because training and test data often share the same co-occurrence patterns.
To rigorously test the problem, Nandy created CoMIX-Shift, a controlled benchmark designed to stress compositional generalization through five specific challenges: held-out intent pairs, discourse-pattern shifts, longer/noisier sentence wrappers, held-out clause templates, and zero-shot triples. The proposed solution, ClauseCompose, is a lightweight decoder trained exclusively on singleton intents, not full multi-intent utterances. The results are striking: ClauseCompose reached 95.7% exact match on unseen intent pairs and 91.1% on unseen triples. In contrast, a fine-tuned tiny BERT model scored 0.0% on triples, and a WholeMultiLabel baseline scored 0.0% on triples and only 41.3% on a separate SNIPS-style test set.
The research demonstrates that current multi-intent detection systems are not being evaluated on their ability to generalize, which is crucial for real-world deployment where users will inevitably make novel requests. The success of the simple ClauseCompose model suggests that a factorization approach—breaking down complex utterances into clauses and recombining known intents—is a highly effective and lightweight path forward. This work pushes the field toward more robust evaluation and provides a promising method for building assistants that can truly understand the creative ways users express their needs.
- ClauseCompose, a lightweight decoder, achieved 95.7% exact match accuracy on unseen combinations of intents in the CoMIX-Shift benchmark.
- It dramatically outperformed baselines, scoring 91.1% on zero-shot triples where a fine-tuned BERT model scored 0.0%.
- The research introduces a new benchmark, CoMIX-Shift, designed to rigorously test compositional generalization, a weakness in current AI assistants.
Why It Matters
Enables virtual assistants to correctly understand novel user requests, moving beyond memorized phrases to true language comprehension.