Study finds TCS techniques for vision DNNs don't fully generalize to LLM code models
Uncertainty-based features excel at catching failures early in LLMs for code.
A new replication study from Asgari et al. (accepted at ISSTA 2026) investigates whether test case selection (TCS) techniques—proven effective for vision-based deep neural networks—generalize to LLMs for code. The researchers evaluated 13 selection strategies (12 feature-aware plus simple random sampling) across 17 task-specific fine-tuned models on three code classification tasks: clone detection, vulnerability detection, and technical debt prediction. They measured performance along two dimensions: accuracy estimation and early failure discovery.
The results reveal that only a subset of findings from vision DNNs hold for LLM code models. Uncertainty-based features (e.g., predictive entropy) are particularly effective for early failure discovery—crucial when labeling budgets are tight and catching bugs early is priority. Conversely, representation-based features (e.g., latent space distances) prove more robust for estimating overall model accuracy. However, performance varies significantly across tasks and individual models, indicating that TCS effectiveness is highly context-dependent. This work provides empirical evidence for the replicability of TCS beyond vision and offers actionable insights for operational evaluation of LLMs for code.
- Tested 13 selection strategies (12 feature-aware + random sampling) on 17 fine-tuned LLM code models across 3 tasks
- Uncertainty-based features outperform for early failure discovery; representation-based features better for accuracy estimation
- Performance varies widely by task and model—TCS effectiveness is context-dependent
Why It Matters
Guides practitioners on which test selection strategies actually work for LLM code models, improving reliability and cost efficiency.