Open Source

We gave 12 LLMs a startup to run for a year. GLM-5 nearly matched Claude Opus 4.6 at 11× lower cost.

A year-long startup simulation reveals GLM-5's cost-efficiency, performing within 5% of Opus for 1/11th the API cost.

Deep Dive

Collinear AI's new YC-Bench benchmark puts large language models through a grueling, year-long simulation where they act as the CEO of a startup. The environment features hundreds of decision points, delayed and sparse feedback, and a market where 35% of clients secretly inflate work requirements. The results, from testing 12 models with three random seeds each, reveal a stark performance hierarchy. Claude Opus 4.6 topped the leaderboard with an average of $1.27 million in final capital, but the real story is GLM-5, which came within 5% of Opus's performance at just $7.62 per API run—an 11x cost reduction compared to Opus's $86. GPT-5.4 placed third with $1.00 million, while all other models, including several that went bankrupt, failed to even maintain the starting $200,000.

Beyond raw scores, the benchmark exposes a critical capability most evaluations miss: long-horizon coherence under delayed feedback. Success wasn't tied to model size or standard benchmark scores, but to whether the model consistently used a persistent scratchpad to record learnings. Top-performing models like Opus and GLM-5 actively maintained and rewrote their internal notes approximately 34 times per simulation run. In contrast, lower-performing models averaged between 0 and 2 scratchpad entries, often collapsing into strategic loops or abandoning plans. For developers building production agentic pipelines, the cost-efficiency curve is dramatic, with Kimi-K2.5 leading in revenue-per-API-dollar at 2.5x better than the next model. The fully open-source benchmark provides a new, rigorous test for real-world agentic reasoning.

Key Points
  • GLM-5 achieved $1.21M in final funds, within 5% of Claude Opus 4.6's $1.27M, but at an 11x lower API cost ($7.62 vs. $86 per run).
  • Long-horizon coherence was the key differentiator; top models actively used a persistent scratchpad, rewriting notes ~34 times per simulation.
  • The YC-Bench simulation is fully open-source and tests models over hundreds of turns with delayed feedback and deceptive clients, where 35% inflate work requirements.

Why It Matters

This benchmark provides a realistic, cost-focused evaluation for deploying AI agents in complex, long-term business scenarios, shifting focus from raw performance to operational efficiency.