The findingTake the image away and GPT-4o's chart accuracy collapses from 86% to 30%. For a visual question, the answer is in the picture.
More research
- Audio and perception Let the user narrate the frame Speech can be nearly 3x faster than mobile typing in controlled text-entry tests. The useful handoff is not just the frame; it is the frame plus why it matters. 2026
- Audio and perception A transcript is not the same as hearing A transcript is the right first audio layer, but native audio models exist because timing, speakers, and non-speech sounds can carry context text drops. 2026
- Audio and perception Perceptual context needs a boundary The more perceptual the payload gets, the more scoped it has to be. Screen and audio capture are powerful precisely because they can expose the wrong thing. 2026
- The formatting layer Why the text layer earns its place Image and text are complementary: add the on-screen text to a capped-resolution image and document accuracy jumps from 84.4% to 91.2%. 2026
- Right-sizing There's a right size for a screenshot More detail helps the model read, up to a point. Reading accuracy climbs from 29% to 77% with resolution, but fine-OCR peaks and then declines. 2026
- Structure Label it, and put it in order On real web tasks a text-only agent succeeds 7% of the time; add the screenshot and it doubles to 15%; label the elements and it climbs again. 2026