Multimodal embeddings at scale: AI data lake for media and entertainment workloads
AWS built a massive video search system using Nova models, processing 792K videos in 41 hours for under $30K.
AWS has demonstrated a production-ready, scalable architecture for semantic search across massive video libraries. The solution, built on Amazon Bedrock's Nova Multimodal Embeddings and Nova Pro models, processed 792,270 videos totaling 8,480 hours of content in just 41 hours. Using four c7i.48xlarge EC2 instances with 600 parallel workers, the system generated audio-visual embeddings for 15-second video chunks and added 10-15 descriptive tags per video. The total first-year cost was $27,328, with $17,096 spent on Nova embeddings and $571 on Nova Pro tagging.
The architecture enables three powerful search modes: converting natural language queries to embeddings for text-to-video search, comparing video embeddings directly for video-to-video search, and hybrid search that combines vector similarity (70% weight) with keyword matching (30% weight). The embeddings are stored in an OpenSearch k-NN index, while metadata tags reside in a separate text index. This approach allows media companies to move beyond manual tagging and keyword-based systems to capture the full semantic richness of video content at an unprecedented scale and cost-efficiency.
- Processed 792,270 videos (8,480 hours) in 41 hours using 4x c7i.48xlarge EC2 instances with 600 parallel workers
- First-year total cost of $27,328, with $17,096 for Nova Multimodal Embeddings and $571 for Nova Pro tagging
- Enables text-to-video, video-to-video, and hybrid semantic search using 1024-dimensional embeddings stored in OpenSearch
Why It Matters
Enables media companies to implement semantic video search at scale, replacing manual tagging with AI-powered discovery at predictable costs.