CCT (306M parameters) achieves 21.27 perplexity on WikiText-103, a 12% relative improvement over fine-tuned GPT-2 Small (24.19)?

CCT (306M parameters) achieves 21.27 perplexity on WikiText-103, a 12% relative improvement over fine-tuned GPT-2 Small (24.19).

Ablation reveals 84% of the gain (2.45 PPL) comes from simplicial message passing (GT-Full)?

Ablation reveals 84% of the gain (2.45 PPL) comes from simplicial message passing (GT-Full).

Consistency-style priors (sheaf smoothing, adjunction round-trip, curvature) showed no improvement, supporting a structure/consistency distinction?

Consistency-style priors (sheaf smoothing, adjunction round-trip, curvature) showed no improvement, supporting a structure/consistency distinction.

Research & Papers

CCT Achieves 12% Perplexity Reduction with Category-Theoretic Inductive Biases

arXiv cs.AI May 29, 2026

⚡A 306M-parameter model beats GPT-2 Small by 2.92 PPL using simplicial message passing.

Deep Dive

Al Kari's new paper introduces the Cognitive Categorical Transformer (CCT), a 306M-parameter architecture that enhances a pretrained GPT-2 Small backbone with cognitively grounded components from category theory and cognitive science. Under a matched-step protocol (215,000 optimizer steps, identical data, optimizer, and schedule) on WikiText-103, CCT reaches 21.27 validation perplexity, compared with 24.19 for an identically fine-tuned GPT-2 Small baseline—a 2.92 PPL (12% relative) reduction beyond what in-domain fine-tuning alone provides. A retrain-from-scratch ablation that bypasses GT-Full simplicial message passing across all seven phases hits 23.72 PPL, localizing 84% of the architectural improvement (2.45 of 2.92 PPL) to GT-Full. This provides the first ablation-validated evidence that simplicial message passing improves language-model perplexity at this scale.

Three negative results on consistency-style categorical priors (sheaf smoothing, adjunction round-trip, curvature regularization) and the joint structural-prior result for GT-Full and PrecisionWeightedPP together support an empirical pattern called the structure/consistency distinction. Categorical priors that add new topology improve language modeling, while those that enforce a consistency identity do not. Notably, the published GPT-2 Large (762M parameters) achieves 22.05 zero-shot PPL on WikiText-103—a benchmark that CCT surpasses with 2.5x fewer parameters, though the paper treats that as an external reference. These findings suggest category theory can yield practical efficiency gains in transformer architectures.

Key Points

CCT (306M parameters) achieves 21.27 perplexity on WikiText-103, a 12% relative improvement over fine-tuned GPT-2 Small (24.19).
Ablation reveals 84% of the gain (2.45 PPL) comes from simplicial message passing (GT-Full).
Consistency-style priors (sheaf smoothing, adjunction round-trip, curvature) showed no improvement, supporting a structure/consistency distinction.

Why It Matters

Category-theoretic inductive biases can reduce perplexity by 12% with fewer parameters, enabling more efficient language models.

Read Original Article

CCT Achieves 12% Perplexity Reduction with Category-Theoretic Inductive Biases

Why It Matters

Related Articles

🚀 Stay Ahead in AI