NeuroFlake uses a novel Discriminative Token Mining (DTM) module to extract statistically significant source code tokens and inject them into LLM attention?

NeuroFlake uses a novel Discriminative Token Mining (DTM) module to extract statistically significant source code tokens and inject them into LLM attention

Achieves 69.34% F1-score on FlakeBench, outperforming prior state-of-art (65.79%) by 3.55 percentage points?

Achieves 69.34% F1-score on FlakeBench, outperforming prior state-of-art (65.79%) by 3.55 percentage points

Under adversarial perturbations (dead code, variable rename), performance drops only 4–7 pp vs. 8–18 pp for baseline LLMs?

Under adversarial perturbations (dead code, variable rename), performance drops only 4–7 pp vs. 8–18 pp for baseline LLMs

Developer Tools

NeuroFlake boosts flaky test classification with neuro-symbolic LLM fusion

arXiv cs.SE May 13, 2026

⚡Boosts F1-score to 69.34% and drops sensitivity to code perturbations by half

Deep Dive

NeuroFlake is a new neuro-symbolic framework for classifying flaky tests—tests that non-deterministically pass or fail on the same code version. Developed by Khondaker Tasnia Hoque and Toukir Ahammed, NeuroFlake addresses the semantic fragility of standard LLMs, which often overfit to superficial patterns like variable names. The framework introduces a Discriminative Token Mining (DTM) module that automatically discovers high-fidelity tokens (e.g., concurrency primitives, async waits) and injects them directly into the LLM attention mechanism, bridging neural intuition with symbolic precision.

Evaluated on the heavily imbalanced FlakeBench dataset, NeuroFlake achieves an F1-score of 69.34%, surpassing the prior best of 65.79%. To test robustness, the authors applied semantic-preserving perturbations such as dead code injection and variable renaming. While baseline models degraded 8–18 percentage points (pp), NeuroFlake maintained stability with only 4–7 pp drop. This work demonstrates that neuro-symbolic fusion can significantly improve both accuracy and generalization for flaky test classification in real-world software engineering.

Key Points

NeuroFlake uses a novel Discriminative Token Mining (DTM) module to extract statistically significant source code tokens and inject them into LLM attention
Achieves 69.34% F1-score on FlakeBench, outperforming prior state-of-art (65.79%) by 3.55 percentage points
Under adversarial perturbations (dead code, variable rename), performance drops only 4–7 pp vs. 8–18 pp for baseline LLMs

Why It Matters

More reliable flaky test detection means fewer CI/CD false alarms and faster debugging for developers.

Read Original Article

NeuroFlake boosts flaky test classification with neuro-symbolic LLM fusion

Why It Matters

Related Articles

🚀 Stay Ahead in AI