PyTorch reproduction of TensorFlow paper underperforms by 4 pp on DermaMNIST , what cross-framework issues should I check? [R]
A 4 percentage point gap stumps this reproduction effort despite extensive tuning.
A Reddit user (u/Plane_Stick8394) is struggling to reproduce a TensorFlow paper in PyTorch—specifically, Ahmed et al.'s "A Lightweight Hybrid Gabor Deep Learning Approach" (IJCV 2026) on DermaMNIST. The original TensorFlow implementation achieved 77.01% test accuracy, but the PyTorch reproduction consistently lands at 73-74%, a gap of roughly 4 percentage points. The architecture combines a fixed Gabor filter bank front-end (scales 0.10, 0.20, 0.40 with 4 orientations each) with a small CNN featuring one SE block, one residual block, and three fully connected layers, totaling about 340k parameters.
The user has already attempted several fixes: varying sigma_factor (1.0 vs 1.2), testing multiple random seeds (42, 0, 123), and adjusting the sigma values for the low-pass and high-pass filter channels. None of these closed the gap. Additionally, the user noticed that per-epoch validation accuracy (e.g., 76.07%) is higher than final test accuracy (73.82%), suggesting possible overfitting to the validation split or mismatched test preprocessing.
The most likely causes are cross-framework differences. TensorFlow's default padding behavior (e.g., 'SAME' vs PyTorch's explicit padding) can alter Gabor filter responses. Weight initialization schemes (Glorot vs He) and batch norm momentum defaults also differ. Data augmentation pipelines—especially normalization statistics and resizing interpolation—could shift input distributions. The user should verify that the Gabor kernel generation matches exactly (zero-mean, L2 normalization), check padding calculations for convolution and pooling layers, and ensure the test set evaluation uses the exact same preprocessing as the paper. Matching these details could recover the missing 4%.
- Gabor filter parameters (sigma, kernel sizes, padding) must be replicated exactly between TensorFlow and PyTorch to avoid response mismatch.
- Framework-specific defaults for weight initialization, batch norm momentum, and dropout behavior can cause systematic accuracy drops of 2-4%.
- Data preprocessing pipeline (resizing interpolation, normalization, test augmentation) differences are a common hidden source of reproducibility gaps.
Why It Matters
Reproducibility across frameworks is essential for scientific progress; even small implementation details can significantly impact reported results.