Fun fact: Anthropic has never open-sourced any LLMs
Unlike OpenAI, Meta, and Google, Anthropic has never open-sourced any LLM components for analysis.
A viral technical discussion has highlighted a significant gap in AI transparency: Anthropic has never open-sourced any components of its Claude large language models. This realization emerged from a researcher's side project comparing tokenizer efficiency for multilingual encoding, who found no way to analyze Claude's tokenizer—the algorithm that converts text into numerical tokens for the model to process. This stands in stark contrast to practices at other leading AI labs. OpenAI has open-sourced tokenizers for its GPT models and released smaller models like GPT-4o Mini. Meta's entire Llama series (Llama 2, Llama 3) is openly available. Google's research confirmed its Gemma models share a tokenizer with the closed Gemini. The inability to inspect Claude's tokenizer prevents researchers from benchmarking its efficiency, understanding its multilingual capabilities, or replicating its encoding strategies. This closed approach impacts the scientific community's ability to audit, improve upon, or even fully understand the model's behavior, particularly for non-English languages. While Anthropic emphasizes safety and controlled deployment, this policy creates a black box that limits external innovation and scrutiny around one of the industry's top-performing models.
- Anthropic maintains a fully closed-source policy for Claude models, including core components like tokenizers.
- OpenAI, Meta (Llama), and Google (Gemma) have all open-sourced model components, enabling research and analysis.
- Researchers cannot study Claude's tokenizer efficiency, especially for multilingual tasks, hindering comparative benchmarks and reproducibility.
Why It Matters
Closed models limit research reproducibility, benchmarking, and innovation, concentrating architectural knowledge within a single company.