Qwen3.5-4B handwriting recognition is really good
A 4-billion parameter model running locally on an RTX 3070 laptop perfectly transcribed complex handwritten diagrams.
A viral demonstration has shown that Alibaba's relatively small Qwen3.5-4B language model possesses surprisingly robust capabilities in optical character recognition (OCR), specifically for handwritten text. A user tested the 'Qwen3.5-4B-UD-Q4_K_XL' quantized version of the model, running it locally on a laptop with an RTX 3070 GPU using the llama.cpp inference engine. The task was to transcribe a complex, handwritten diagram detailing a knowledge management system, complete with interconnected loops, arrows, color-coded sections, and mixed print/cursive writing.
The model successfully transcribed the entire diagram with high accuracy, identifying text from the 'Top Right Corner' down to flow arrows and specific elements like 'Summarize (written in red ink).' It processed the image and generated the detailed text output in 2 minutes and 25 seconds of 'thinking time,' consuming 6,795 tokens and achieving a generation speed of 46 tokens per second. This performance highlights a significant leap in making advanced multimodal AI—models that understand both images and text—accessible and practical. It proves that a 4-billion parameter model, when properly quantized, can execute sophisticated vision-language tasks on consumer-grade hardware without needing cloud API calls to giants like GPT-4V or Claude 3.5 Sonnet.
- Alibaba's 4-billion parameter Qwen3.5-4B model accurately transcribed a complex handwritten diagram with colored text and arrows.
- The test ran locally on a laptop RTX 3070 GPU using llama.cpp, achieving 46 tokens/second generation speed.
- The model's strong OCR performance on consumer hardware makes advanced multimodal AI more accessible and private.
Why It Matters
It enables accurate, offline document digitization and diagram understanding, reducing reliance on cloud APIs for sensitive data.