Prompts you use to test/trip up your LLMs
A viral Reddit post reveals simple, common-sense questions that stump even 26B parameter reasoning models.
A viral Reddit post by user FenderMoon is sparking discussion among AI developers by revealing a simple method to expose fundamental reasoning flaws in local large language models (LLMs). The user compiled a list of novel, common-sense prompts not found in training data, designed to test an AI's understanding of physical constraints and basic logic. Examples include asking if you should start writing without a pen or watch a video without your phone. The creator tested these on models like Google's Gemma 2 9B (with reasoning enabled) and a 26B parameter Mixture-of-Experts (MoE) model, finding that while larger models handle 'easy' prompts, they still fail on nuanced versions.
The investigation highlights a critical gap between linguistic proficiency and practical, embodied reasoning—a key hurdle for developing reliable AI agents. For instance, a prompt about sending a message without a phone passes if the word 'immediately' is included but fails hilariously without it, showing the model's reliance on superficial keyword triggers. Other failures involve models giving bizarre, ethical tangents instead of straightforward answers about whether a sound can be heard from another room. This grassroots benchmarking provides a crucial, real-world complement to standard academic benchmarks, offering developers a quick way to stress-test the 'common sense' of new models before deployment.
- Novel prompts like 'Should I start writing if my pen is across the room?' reliably stump smaller LLMs and even trip up 26B parameter models.
- The creator intentionally used prompts not found in training data to test genuine reasoning, not memorization.
- Failure cases reveal models often miss basic physical logic and over-rely on keyword presence, like the word 'immediately'.
Why It Matters
These simple tests expose a core weakness in AI reasoning that must be solved before LLMs can reliably act as autonomous agents in the real world.