DeepMind's BYOL and Meta's JEPA face hyperparameter selection challenge with non-monotonic loss
Evaluating self-supervised learning when loss doesn't decrease is a researcher's nightmare.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A Reddit discussion highlights a fundamental pain point in self-supervised representation learning: how to pick hyperparameters and architectures when the training loss doesn't steadily decrease? Non-contrastive methods like BYOL (DeepMind), JEPA (Meta), and data2vec are popular yet notoriously opaque. The loss landscape is non-monotonic, meaning traditional early stopping or learning rate schedules fail. Researchers often turn to linear probing or KNN accuracy on downstream tasks, but this risks 'p-hacking' by peeking at test data during training—a classic degrees-of-freedom problem.
RankMe, a metric that computes the effective rank of the embedding matrix via SVD, was proposed as a proxy for representation quality. However, methods like JEPA already incorporate entropy-collapse terms (e.g., Barlow Twins, VICReg, SIGREG) that directly penalize rank collapse. Thus RankMe becomes absorbed into the loss and loses independence—increasing the penalty weight can artificially inflate rank without improving real transfer. The community is now asking: what metrics truly generalize? Options like contrastive accuracy on held-out views or invariance to augmentations are being debated, but no silver bullet exists yet.
- Non-contrastive SSL methods (BYOL, JEPA, data2vec) lack monotonic loss, making hyperparameter selection a guessing game.
- Linear probing/KNN during training risks researcher degrees of freedom abuse.
- RankMe metric gets 'absorbed' by existing entropy-collapse regularization, nullifying its value as an independent criterion.
Why It Matters
Without robust evaluation criteria, progress in self-supervised learning stalls—impacting every downstream task from image understanding to robotics.