Enterprise & Industry

I used Gemini Nano Banana 2 to create sketchnotes - here's what it got right (and hilariously wrong)

The AI nails hand-drawn aesthetics but hilariously mixes Roman/Arabic numerals and duplicates content.

Deep Dive

ZDNET's David Gewirtz put Google's Gemini Nano 2 (the model formerly referred to in development as 'Nano Banana') to the test by having it generate sketchnotes—a popular visual note-taking style combining informal sketches with formal data representation. Using a $20/month Google AI Pro subscription, Gewirtz prompted the model to create a sketchnote of the US Bill of Rights. The initial results were visually impressive, featuring a perfect hand-drawn aesthetic with pastel highlighter-like colors and an appropriate font. However, the content was flawed, with the AI duplicating the Fifth Amendment and randomly switching between Arabic and Roman numerals after the fifth entry.

Subsequent, more detailed prompts aimed to correct these issues by specifying layout and numbering, but Gemini Nano 2 continued to struggle with textual accuracy. It produced outputs with rights out of order, inconsistent numbering (like placing a '7' next to 'Article three'), and a failure to consistently use requested Arabic numerals. The experiment highlights a persistent gap in multimodal AI: while image generation for style and composition has advanced significantly, reliable text generation within those images remains a notable weakness. For professionals, this means AI-generated visual summaries can be a great starting point but require careful human review and iterative prompting to achieve accuracy.

Key Points
  • Gemini Nano 2 excelled at visual style, generating perfect hand-drawn aesthetics and pastel color palettes for sketchnotes.
  • The model consistently failed on text, mixing Roman/Arabic numerals and duplicating content like the Fifth Amendment.
  • Iterative prompting improved results, but textual accuracy within AI-generated graphics remains a significant challenge.

Why It Matters

Shows the current limits of multimodal AI for professional content creation, where visual style is ahead of reliable text generation.