Extended NYT Connections Benchmark: Model Introduction Date vs. Performance by Lab since 2024
Open-source benchmark reveals which labs' models are improving fastest at complex reasoning tasks.
A new open-source benchmarking tool is providing unprecedented visibility into the rapid evolution of large language models. Created by independent researcher Lech Mazur, the extended NYT Connections benchmark tracks performance of models from major AI labs—including OpenAI, Anthropic, Google, and Meta—on the popular word association puzzle since January 2024. Unlike traditional benchmarks that measure capability at a single point in time, this tool visualizes how each lab's models have improved their reasoning and pattern recognition abilities month over month.
The benchmark requires models to identify connections between seemingly unrelated words, testing capabilities like abstract reasoning, common sense knowledge, and lateral thinking. Early data shows significant performance gaps between different model families, with some labs demonstrating faster iterative improvement than others. The tool is already revealing interesting patterns about development velocity and which architectural approaches are yielding the most consistent gains in complex reasoning tasks.
Available on GitHub, the benchmark provides researchers and developers with a standardized way to compare model progress across different companies and time periods. This represents a shift toward more transparent, longitudinal evaluation of AI capabilities, moving beyond the snapshot comparisons that have dominated AI benchmarking until now.
- Tracks performance of models from OpenAI, Anthropic, Google, and Meta on NYT Connections puzzles since January 2024
- Measures improvements in complex reasoning and pattern recognition capabilities over time
- Open-source tool available on GitHub provides standardized comparison across labs and development timelines
Why It Matters
Provides transparent, longitudinal data on which AI labs are making the fastest progress in complex reasoning capabilities.