Developer Tools

NeuroFlake boosts flaky test classification with neuro-symbolic LLM fusion

Boosts F1-score to 69.34% and drops sensitivity to code perturbations by half

Deep Dive

NeuroFlake is a new neuro-symbolic framework for classifying flaky tests—tests that non-deterministically pass or fail on the same code version. Developed by Khondaker Tasnia Hoque and Toukir Ahammed, NeuroFlake addresses the semantic fragility of standard LLMs, which often overfit to superficial patterns like variable names. The framework introduces a Discriminative Token Mining (DTM) module that automatically discovers high-fidelity tokens (e.g., concurrency primitives, async waits) and injects them directly into the LLM attention mechanism, bridging neural intuition with symbolic precision.

Evaluated on the heavily imbalanced FlakeBench dataset, NeuroFlake achieves an F1-score of 69.34%, surpassing the prior best of 65.79%. To test robustness, the authors applied semantic-preserving perturbations such as dead code injection and variable renaming. While baseline models degraded 8–18 percentage points (pp), NeuroFlake maintained stability with only 4–7 pp drop. This work demonstrates that neuro-symbolic fusion can significantly improve both accuracy and generalization for flaky test classification in real-world software engineering.

Key Points
  • NeuroFlake uses a novel Discriminative Token Mining (DTM) module to extract statistically significant source code tokens and inject them into LLM attention
  • Achieves 69.34% F1-score on FlakeBench, outperforming prior state-of-art (65.79%) by 3.55 percentage points
  • Under adversarial perturbations (dead code, variable rename), performance drops only 4–7 pp vs. 8–18 pp for baseline LLMs

Why It Matters

More reliable flaky test detection means fewer CI/CD false alarms and faster debugging for developers.