Researcher's GPT-like model on non-language series fails to learn auto-regression
Training on 750M tokens, model still gets stuck on a single token.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A Reddit researcher (u/gartin336) is training GPT-like decoder-only transformers with 100M, 250M, and 500M parameter variants on 750M tokens (vocabulary ~15k–100k tokens). Using AdamW (lr=1e-3, betas=[0.9,0.95]), effective batch size 4M tokens, 16 epochs, 200 warmup steps, 16 layers (up to 48 tested), 16 attention heads, and context window 1000, the model often gets stuck generating a single token repeatedly. The post asks if training GPT-like models is still black magic and requests tricks—no mention of the data being non-language.
- GPT-like models (100M–500M params) trained on 750M tokens of non-language series fail to learn next-token prediction.
- Key hyperparameters: AdamW lr=1e-3, batch size 4M tokens, context window 1000, up to 48 layers.
- Model repeatedly generates a single token, indicating a fundamental failure in autoregressive learning.
Why It Matters
Reveals critical challenges in applying language model architectures to non-language sequences for professionals.