Developer Tools

AI models are terrible at betting on soccer—especially xAI Grok

A new benchmark reveals top AI models like GPT-5.4 and Claude Opus 4.6 can't beat human sports bettors.

Deep Dive

A new study from AI startup General Reasoning reveals a significant gap in AI capabilities when it comes to long-term, real-world reasoning. The company's "KellyBench" report tested eight top AI models—including Anthropic's Claude Opus 4.6, OpenAI's GPT-5.4, Google's Gemini models, and xAI's Grok 4.20—in a simulated betting environment based on the 2023-24 Premier League soccer season. Each AI agent was given a £100,000 bankroll, detailed historical data, and three attempts to build a profitable betting model as the season progressed. The results were stark: every frontier model lost money on average, with many experiencing total ruin. Claude Opus 4.6 performed best but still averaged an 11% loss, while Grok 4.20 went bankrupt.

The findings challenge the prevailing narrative of imminent, widespread AI automation in complex professional fields. According to Ross Taylor, CEO of General Reasoning and a former Meta AI researcher, typical AI benchmarks are flawed because they exist in "very static environments" unlike the chaotic real world. While AI has shown remarkable progress in tasks like software engineering, this study demonstrates its current shortcomings in scenarios requiring adaptation to new events, risk management, and reasoning over extended time horizons. The paper serves as a counterweight to Silicon Valley hype, suggesting that white-collar professionals in finance, analytics, and strategy may have more job security than feared, as AI still struggles with the nuanced, long-term decision-making that defines many high-value business activities.

Key Points
  • Every AI model tested lost money on average in the simulated Premier League betting season, with Anthropic's Claude Opus 4.6 posting the smallest average loss of 11%.
  • xAI's Grok 4.20 performed worst, going bankrupt in the challenge and failing to complete all attempts, while Google's Gemini 3.1 Pro showed high volatility with one 34% profit run but also a bankruptcy.
  • The study's authors conclude AI "systematically underperforms humans" in this long-horizon scenario, highlighting a critical gap between AI's coding prowess and its real-world reasoning abilities.

Why It Matters

The study reveals a major limitation in current AI, showing it struggles with long-term, adaptive reasoning crucial for finance, strategy, and analytics roles.