Falcon-OCR and Falcon-Perception
The new multimodal models can read text from images and understand visual scenes with high accuracy.
The Technology Innovation Institute (TII) in Abu Dhabi has expanded its influential Falcon family of open-source AI models with two new multimodal entrants: Falcon-OCR and Falcon-Perception. These models mark a significant step in bringing advanced vision-and-language capabilities to the open-source community, directly challenging proprietary offerings. Falcon-OCR is specifically fine-tuned for Optical Character Recognition (OCR) tasks, enabling it to accurately read and interpret text from a wide variety of images, from documents to real-world scenes. Falcon-Perception, meanwhile, is a more general-purpose Vision Language Model (VLM) designed to understand and reason about visual content, answering questions about images and describing scenes.
Both models are built upon TII's established Falcon architecture, known for its strong performance in language tasks, and are now available for download and experimentation on the Hugging Face platform. In a move that underscores the collaborative nature of the open-source AI ecosystem, community developers have already submitted a pull request to integrate these new Falcon models into the popular llama.cpp project. This integration would allow users to run the vision-capable Falcons efficiently on consumer hardware, using CPU and GPU acceleration for local, private inference, significantly lowering the barrier to entry for advanced multimodal AI.
- TII UAE released two open-source multimodal models: Falcon-OCR for text extraction from images and Falcon-Perception for general visual understanding.
- The models are built on the proven Falcon architecture and are immediately available for download on Hugging Face.
- Community developers are working to add support to llama.cpp, enabling efficient local inference on standard hardware.
Why It Matters
This democratizes advanced multimodal AI, giving developers and researchers powerful, free alternatives to closed-source vision models from major labs.