Research & Papers

[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.

First real-hardware benchmark shows leading models like GR00T and OpenPI need human intervention every 3-4 minutes during warehouse tasks.

Deep Dive

Positronic Robotics has launched PhAIL (phail.ai), a groundbreaking open benchmark that finally provides honest performance metrics for robot AI models on real hardware. Unlike controlled demos or simulations, PhAIL tests Vision-Language-Action (VLA) models on actual DROID platforms performing bin-to-bin order picking—one of the most common warehouse operations. The benchmark measures what matters to operations teams: Units Per Hour (UPH) and Mean Time Between Failures (MTBF). In blind evaluations, the top-performing models—OpenPI (pi0.5) and NVIDIA's GR00T—achieved only 65 and 60 UPH respectively, with robots failing and requiring human intervention every 3.5 to 4 minutes.

These numbers pale in comparison to human performance. A human teleoperating the same robot achieved 330 UPH, revealing a 5x performance gap largely due to poor policy quality, not hardware limitations. The most telling metric is MTBF: at 4 minutes between failures, so-called autonomous operation requires a full-time human supervisor, destroying the economic case for deployment. Every test run is publicly available with synced video and telemetry, and the team has open-sourced their fine-tuning dataset, training scripts, and submission pathway, inviting others to prove their models can do better. PhAIL represents a crucial reality check for the robotics field, shifting focus from flashy demos to measurable, economic viability.

Key Points
  • Best AI models (OpenPI & GR00T) achieve only 65 Units Per Hour vs. 330 for human-teleoperated same hardware
  • Mean Time Between Failures is just 3.5-4 minutes, meaning robots need constant human intervention
  • Performance gap is 5x and primarily due to poor AI policy quality, not physical hardware limitations

Why It Matters

Provides the first honest economic viability test for robot AI, showing current models are far from ready for real warehouse deployment without human supervisors.