Benchmarked MiniMax M2.7 through 2 benchmarks. Here's how it did
The new model shows a 3.7-point benchmark jump and solves unique tasks other frontier models miss.
MiniMax's new M2.7 model has made a significant leap in performance, according to independent benchmarking by the Kilo Code team. The model scored 86.2% on the PinchBench OpenClaw agent benchmark, placing it 5th out of 50 models and just 1.2 points behind the highly-regarded Claude Opus 4.6. This represents a 3.7-point improvement over its predecessor, M2.5, moving MiniMax from the middle of the pack into the top tier, where it now competes directly with models like GLM-5 and GPT-5.4.
On the comprehensive Kilo Bench—an 89-task evaluation testing autonomous coding across domains from git operations to cryptanalysis—M2.7 achieved a 47% pass rate, coming in second behind Qwen3.5-plus. The analysis revealed a distinct behavioral profile: M2.7 tends to 'over-explore' by extensively reading surrounding files and analyzing dependencies before writing code. This approach allows it to solve unique reasoning tasks that other models miss, such as a complex SPARQL task requiring nuanced understanding of filter criteria. However, this thoroughness can sometimes lead to timeouts on time-sensitive problems.
The benchmarking data suggests these advanced models are not interchangeable but complementary. A hypothetical 'oracle' that could select the best model for each specific task would solve 67% of the Kilo Bench tasks, a 36% improvement over any single model's performance. This highlights that while M2.7 may not lead in raw pass rate, its unique problem-solving capabilities and affordable pricing make it a valuable tool in a developer's arsenal, particularly for complex reasoning challenges where other frontier models fall short.
- Scored 86.2% on PinchBench, placing 5th overall and within 1.2 points of Claude Opus 4.6
- Achieved 47% pass rate on Kilo Bench's 89 coding tasks with unique 'over-explore' problem-solving behavior
- Solves specific reasoning tasks other models miss, demonstrating models are complementary rather than interchangeable
Why It Matters
For developers, M2.7 offers a cost-effective model with unique reasoning strengths that can solve problems even frontier models cannot.