Developer Tools

Cost effective deployment of vision-language models for pet behavior detection on AWS Inferentia2

Furbo pet camera now runs BLIP vision models 2–3x cheaper using Amazon’s custom AI silicon.

Deep Dive

Tomofun, the Taiwan-based pet-tech startup behind the Furbo Pet Camera, faced a cost challenge: its always-on AI inference for real-time pet behavior detection—using the BLIP (Bootstrapping Language-Image Pre-Training) vision-language model—was running on expensive GPU-based EC2 instances. To cut costs without sacrificing accuracy, Tomofun migrated to AWS Inferentia2-powered EC2 Inf2 instances, Amazon’s custom AI chips designed for high-throughput, low-cost inference.

The new architecture uses a two-layer Auto Scaling group: the first layer hosts API servers that receive images from Furbo cameras via CloudFront and Elastic Load Balancing, and the second layer runs the BLIP model in containers on Inf2 instances. CloudWatch monitors latency, throughput, and error rates to trigger scaling. Critically, the system can switch between GPU and Inferentia2 backends in real time without changing the upstream API, ensuring high availability and flexibility. By compiling BLIP with the Neuron SDK and leveraging Inferentia2’s efficiency, Tomofun achieved significant cost savings while maintaining the throughput needed for hundreds of thousands of devices.

Key Points
  • Tomofun replaced GPU-based EC2 instances with AWS Inferentia2 (Inf2) instances for BLIP vision-language model inference
  • Two-layer Auto Scaling group architecture isolates API servers from inference containers, with CloudWatch-driven scaling
  • System retains ability to fall back to GPU backends without API changes, ensuring no disruption to Furbo users

Why It Matters

Real-world example of pet-tech startup achieving cost-efficient, real-time AI inference at scale using specialized hardware instead of GPUs.