Modality-specific projection heads re-align pretrained CLIP and CLAP embeddings for cross-modal retrieval?

Modality-specific projection heads re-align pretrained CLIP and CLAP embeddings for cross-modal retrieval.

The method outperforms zero-shot baselines, enabling bidirectional retrieval with practical accuracy?

The method outperforms zero-shot baselines, enabling bidirectional retrieval with practical accuracy.

Audio & Speech

New AI retrieves sound effects from onomatopoeia images

arXiv eess.AS May 19, 2026

⚡Researchers built a bidirectional model to match comic-style text art with audio clips.

Deep Dive

A team of researchers led by Keisuke Imoto from Ritsumeikan University has developed a novel bidirectional retrieval system that connects onomatopoeic images—stylized visual representations of sound words like 'BANG' or 'SIZZLE'—with corresponding audio clips. In multimedia production, especially comics and animation, creators manually search for sound effects that match the visual impression of onomatopoeia. The new framework addresses this gap by first extracting embeddings from pretrained image (CLIP) and audio (CLAP) encoders, then training lightweight projection heads to re-align the representations into a shared space. This approach enables accurate retrieval in both directions: from onomatopoeic images to sounds and from sounds to onomatopoeic images.

The team also introduced the Multimodal Image-Audio Onomatopoeia (MIAO) dataset, comprising paired onomatopoeic images and sound clips across 50 distinct sound event classes. Experimental results show that their method substantially outperforms a zero-shot baseline using raw CLIP and CLAP embeddings, demonstrating that domain-specific fine-tuning of pretrained representations is key to bridging visual and auditory onomatopoeia. This work opens up new possibilities for automated sound effect matching in comics, film, and game production, potentially cutting hours of manual search time for creators.

Key Points

MIAO dataset contains paired onomatopoeic images and sound clips across 50 sound event classes.
Modality-specific projection heads re-align pretrained CLIP and CLAP embeddings for cross-modal retrieval.
The method outperforms zero-shot baselines, enabling bidirectional retrieval with practical accuracy.

Why It Matters

Automates sound effect matching for comic and animation creators, saving hours of manual search.

Read Original Article

New AI retrieves sound effects from onomatopoeia images

Why It Matters

Related Articles

🚀 Stay Ahead in AI