Exploring Data Augmentation and Resampling Strategies for Transformer-Based Models to Address Class Imbalance in AI Scoring of Scientific Explanations in NGSS Classroom
GPT-4 data and ALP strategy fix class imbalance, boosting recall and precision to 100%
A team led by Prudence Djagba from Michigan State University tackled class imbalance in automated scoring of student scientific explanations using transformer-based models. They fine-tuned SciBERT on 1,466 high school responses to a physical science assessment aligned with the Next Generation Science Standards (NGSS). The rubric included 11 binary-coded analytic categories covering six scientific ideas and five common misconceptions.
The researchers compared three augmentation strategies: GPT-4-generated synthetic responses, EASE (a word-level extraction and filtering method), and ALP (Augmentation using Lexicalized Probabilistic context-free grammar). While fine-tuning SciBERT alone improved recall over baseline, augmentation dramatically boosted performance. ALP achieved perfect precision, recall, and F1 scores on the most severely imbalanced categories (5, 6, 7, and 9). GPT-4 augmentation improved both precision and recall, and EASE substantially increased alignment with human scoring across all rubric categories. The study also compared these methods to traditional SMOTE oversampling, finding that targeted augmentation avoids overfitting while preserving novice-level data critical for learning progression alignment. This work was published as a conference paper at NARST 2026.
- ALP phrase-level augmentation achieved perfect precision, recall, and F1 on the most imbalanced rubric categories (5,6,7,9)
- GPT-4-generated synthetic responses boosted both precision and recall for SciBERT fine-tuning
- EASE word-level extraction improved alignment with human scoring across all 11 analytic categories
Why It Matters
Enables scalable, accurate AI grading of student reasoning, crucial for NGSS-aligned science education.