Gemma 4 31B beats several frontier models on the FoodTruck Bench
A 31B parameter model outperforms giants like the 397B Qwen 3.5 on a complex, long-horizon planning task.
Google's latest open-weight model, Gemma 4 31B, has delivered a surprising performance by securing 3rd place on the FoodTruck Bench, a benchmark designed to test AI on complex, long-horizon planning tasks. The model, with just 31 billion parameters, outperformed significantly larger competitors, including the 397 billion parameter Qwen 3.5 model, the GLM 5 series, and all variants of Anthropic's Claude Sonnet. This result is notable because it challenges the conventional wisdom that model capability scales directly with parameter count, highlighting efficiency in architectural design.
The FoodTruck Bench simulates running a food truck business over multiple simulated days, requiring the AI to manage inventory, plan menus, set prices, and adapt to customer feedback—a test of sequential reasoning and plan adherence. Analysis suggests Gemma 4's success stems from its ability to handle these extended task horizons and, crucially, to listen to and act upon its own advice from previous planning steps. Unlike some models that fail to complete the run or deviate from their strategy, Gemma 4 demonstrated consistent, goal-directed behavior.
This benchmark result is more than just a leaderboard upset; it signals a shift in how we evaluate AI models. It moves beyond simple question-answering to assess practical, multi-step problem-solving in a dynamic environment. For developers and researchers, it underscores that smaller, well-architected models can punch far above their weight class in specific, complex domains, potentially offering a more efficient path to capable AI agents.
- Gemma 4 31B, a smaller open model, beat the massive 397B Qwen 3.5 and Claude Sonnet on FoodTruck Bench.
- The benchmark tests long-horizon planning for a simulated business, requiring multi-day strategy and adaptation.
- Key to its success was the model's ability to follow its own internal advice across sequential planning steps.
Why It Matters
It proves smaller, efficient models can rival giants at complex agentic tasks, changing the cost-benefit calculus for AI deployment.