Devstral small 2 24b severely underrated
A 24B parameter model outperformed 30B+ competitors on a unique reinforcement learning task, running on a 16GB GPU.
A viral Reddit post from an academic developer is challenging the conventional wisdom on local AI coding assistants. The user, working with a personal 16GB NVIDIA 4060 Ti GPU, tested multiple popular models on a highly specific task: understanding and modifying an unpublished, novel reinforcement learning algorithm written in NumPy and accelerated with Numba's @jit decorator. The goal was to have the AI explain the code's function and expand it from a 5-element to a 7-element transitive inference task. Despite testing larger models like GLM 4.7 Flash 30B and Qwen3 Coder 30B—some requiring overnight 4-bit quantization runs—only Devstral's Small 2 24B parameter model delivered a usable, partially correct response.
The results underscore a critical divide in model capabilities. While many larger models excel at 'vibe coding' or common programming patterns seen in their training data, Devstral Small 2 demonstrated superior analytical reasoning on a novel, out-of-distribution problem. The model managed this even with a constrained 20k token context window, which forced 10% of processing onto the CPU, yet maintained usable inference speed. This practical test case suggests that for developers and researchers working on cutting-edge or non-standard code, smaller, more capable reasoning models like Devstral's may offer more intelligent assistance than larger, more generalized alternatives, all while remaining accessible on consumer-grade hardware.
- Devstral Small 2 24B was the only model to provide a partially correct analysis of novel, unpublished reinforcement learning code, beating models with 30B+ parameters.
- The test was conducted on consumer hardware—a 16GB NVIDIA 4060 Ti GPU—with the model using a 20k context window (10% on CPU) at usable speeds.
- The finding challenges the 'bigger is better' assumption for coding AIs, highlighting a niche for models with strong reasoning on out-of-distribution tasks versus 'vibe coding'.
Why It Matters
For researchers and developers working on novel problems, smaller, reasoning-focused models can provide more intelligent assistance than larger generalists, democratizing advanced AI tooling.