Media & Culture

Grok for you

The multimodal model processes documents, diagrams, and photos, challenging GPT-4V and Claude 3.

Deep Dive

Elon Musk's xAI has released Grok 1.5 Vision, a significant multimodal upgrade to its Grok AI assistant. The new model can now process and understand visual information, including uploaded images, documents, diagrams, and screenshots. This brings it into direct competition with other leading multimodal models like OpenAI's GPT-4V and Anthropic's Claude 3. The release was announced on X, where users can now access these capabilities directly within the platform.

Technically, Grok 1.5 Vision demonstrates strong performance on visual reasoning benchmarks. It achieved a score of 65.2% on the challenging RealWorldQA benchmark, which tests understanding of real-world images, surpassing Claude 3 Opus's score of 60.7%. xAI also highlighted its proficiency in processing complex documents, such as converting a hand-drawn flowchart into Python code. The model's context window remains at 128K tokens, matching the text-only Grok 1.5.

The integration is currently rolling out to early testers and premium subscribers on X. This move positions Grok as a more comprehensive AI tool, allowing users to ask questions about photos, analyze data in charts, or extract information from PDFs without leaving the social media app. It represents xAI's continued push to make advanced AI more accessible and integrated into daily digital workflows.

Key Points
  • Adds multimodal vision capabilities to understand images, docs, and screenshots.
  • Scored 65.2% on RealWorldQA benchmark, beating Claude 3 Opus (60.7%).
  • Integrated directly into X platform for early testers and premium subscribers.

Why It Matters

Turns Grok into a versatile visual assistant, enabling document analysis and image reasoning directly within social media.