Research & Papers

DiffusionPen generates Ukrainian handwriting from 308 writer styles

New dataset of 126K Ukrainian handwritten words trains AI for Cyrillic style transfer

Deep Dive

Handwritten text generation (HTG) has mostly focused on Latin scripts, leaving low-resource writing systems like Cyrillic underserved. To close this gap, Andrii Ahitoliev and Pavlo Berezin built a comprehensive Ukrainian handwritten word dataset—126,177 images from 308 distinct writers—using connected-component segmentation, quality filtering, and targeted oversampling for underrepresented Ukrainian characters. They then retrained DiffusionPen, a state-of-the-art latent diffusion model originally designed for Latin HTG, on this new dataset without any architectural modifications. DiffusionPen uses a MobileNetV2-based triplet-loss style encoder to capture writer-specific traits and a CANINE-conditioned U-Net for character-level text guidance.

The results are promising: the model generates legible, style-consistent word images in three cross-domain settings: cross-lingual transfer from English IAM samples, zero-shot imitation of an early 20th-century Ukrainian manuscript, and few-shot adaptation to contemporary writers. The work demonstrates that few-shot latent diffusion models can generalize beyond the Latin-script domain, offering a reproducible benchmark for writer-aware Cyrillic HTG. The authors have released the dataset, trained models, and evaluation protocol on arXiv, providing a foundation for extending stylized text generation to other underrepresented writing systems.

Key Points
  • Dataset: 126,177 images from 308 Ukrainian writers, with targeted oversampling for rare characters.
  • Model: DiffusionPen with MobileNetV2 triplet-loss encoder + CANINE-conditioned latent diffusion U-Net, no architecture changes.
  • Transfer: Achieves legible output in cross-lingual (English→Ukrainian), zero-shot (historical manuscript), and few-shot (contemporary) settings.

Why It Matters

Proves diffusion-based HTG works for low-resource scripts, enabling digital restoration and personalized handwriting for Cyrillic languages.