Azercell and AWS build Azerbaijani LLM on SageMaker AI with 2x token efficiency
Custom tokenizer doubles context window for morphologically rich Azerbaijani language models.
Deep Dive
Azercell Telecom, with AWS Generative AI Innovation Center, built an Azerbaijani LLM on Amazon SageMaker AI in six weeks. Using Liger Kernels on an ml.p5.48xlarge instance, they achieved 23% higher training throughput and 58% lower peak GPU memory. A custom BBPE tokenizer halved tokens per word compared to the baseline. Based on Llama 3.2 1B, the model underwent continued pre-training and LoRA fine-tuning for telecom use cases and a chatbot.
Key Points
- Custom monolingual BBPE tokenizer achieved 2× improvement in tokens per word over baseline English-optimized tokenizers, doubling effective context window.
- Liger Kernels optimizations on ml.p5.48xlarge instance provided 23% higher training throughput and 58% lower peak GPU memory usage.
- Three-stage framework: tokenizer development, continued pre-training on Llama 3.2 1B, and LoRA fine-tuning for conversational AI in telecom use cases.
Why It Matters
Enables efficient LLM training for under-resourced languages, reducing costs and improving context utilization for global AI inclusion.