CCT Achieves 12% Perplexity Reduction with Category-Theoretic Inductive Biases
A 306M-parameter model beats GPT-2 Small by 2.92 PPL using simplicial message passing.
Al Kari's new paper introduces the Cognitive Categorical Transformer (CCT), a 306M-parameter architecture that enhances a pretrained GPT-2 Small backbone with cognitively grounded components from category theory and cognitive science. Under a matched-step protocol (215,000 optimizer steps, identical data, optimizer, and schedule) on WikiText-103, CCT reaches 21.27 validation perplexity, compared with 24.19 for an identically fine-tuned GPT-2 Small baseline—a 2.92 PPL (12% relative) reduction beyond what in-domain fine-tuning alone provides. A retrain-from-scratch ablation that bypasses GT-Full simplicial message passing across all seven phases hits 23.72 PPL, localizing 84% of the architectural improvement (2.45 of 2.92 PPL) to GT-Full. This provides the first ablation-validated evidence that simplicial message passing improves language-model perplexity at this scale.
Three negative results on consistency-style categorical priors (sheaf smoothing, adjunction round-trip, curvature regularization) and the joint structural-prior result for GT-Full and PrecisionWeightedPP together support an empirical pattern called the structure/consistency distinction. Categorical priors that add new topology improve language modeling, while those that enforce a consistency identity do not. Notably, the published GPT-2 Large (762M parameters) achieves 22.05 zero-shot PPL on WikiText-103—a benchmark that CCT surpasses with 2.5x fewer parameters, though the paper treats that as an external reference. These findings suggest category theory can yield practical efficiency gains in transformer architectures.
- CCT (306M parameters) achieves 21.27 perplexity on WikiText-103, a 12% relative improvement over fine-tuned GPT-2 Small (24.19).
- Ablation reveals 84% of the gain (2.45 PPL) comes from simplicial message passing (GT-Full).
- Consistency-style priors (sheaf smoothing, adjunction round-trip, curvature) showed no improvement, supporting a structure/consistency distinction.
Why It Matters
Category-theoretic inductive biases can reduce perplexity by 12% with fewer parameters, enabling more efficient language models.