New AI retrieves sound effects from onomatopoeia images
Researchers built a bidirectional model to match comic-style text art with audio clips.
A team of researchers led by Keisuke Imoto from Ritsumeikan University has developed a novel bidirectional retrieval system that connects onomatopoeic images—stylized visual representations of sound words like 'BANG' or 'SIZZLE'—with corresponding audio clips. In multimedia production, especially comics and animation, creators manually search for sound effects that match the visual impression of onomatopoeia. The new framework addresses this gap by first extracting embeddings from pretrained image (CLIP) and audio (CLAP) encoders, then training lightweight projection heads to re-align the representations into a shared space. This approach enables accurate retrieval in both directions: from onomatopoeic images to sounds and from sounds to onomatopoeic images.
The team also introduced the Multimodal Image-Audio Onomatopoeia (MIAO) dataset, comprising paired onomatopoeic images and sound clips across 50 distinct sound event classes. Experimental results show that their method substantially outperforms a zero-shot baseline using raw CLIP and CLAP embeddings, demonstrating that domain-specific fine-tuning of pretrained representations is key to bridging visual and auditory onomatopoeia. This work opens up new possibilities for automated sound effect matching in comics, film, and game production, potentially cutting hours of manual search time for creators.
- MIAO dataset contains paired onomatopoeic images and sound clips across 50 sound event classes.
- Modality-specific projection heads re-align pretrained CLIP and CLAP embeddings for cross-modal retrieval.
- The method outperforms zero-shot baselines, enabling bidirectional retrieval with practical accuracy.
Why It Matters
Automates sound effect matching for comic and animation creators, saving hours of manual search.