Noru Flow Research

Featured · The case for vision

Does the model actually need to see it?

Read the evidence →

The finding
Take the image away and GPT-4o's chart accuracy collapses from 86% to 30%. For a visual question, the answer is in the picture.

More research

Audio and perception Let the user narrate the frame Speech can be nearly 3x faster than mobile typing in controlled text-entry tests. The useful handoff is not just the frame; it is the frame plus why it matters. 2026
Audio and perception A transcript is not the same as hearing A transcript is the right first audio layer, but native audio models exist because timing, speakers, and non-speech sounds can carry context text drops. 2026
Audio and perception Perceptual context needs a boundary The more perceptual the payload gets, the more scoped it has to be. Screen and audio capture are powerful precisely because they can expose the wrong thing. 2026
The formatting layer Why the text layer earns its place Image and text are complementary: add the on-screen text to a capped-resolution image and document accuracy jumps from 84.4% to 91.2%. 2026
Right-sizing There's a right size for a screenshot More detail helps the model read, up to a point. Reading accuracy climbs from 29% to 77% with resolution, but fine-OCR peaks and then declines. 2026
Structure Label it, and put it in order On real web tasks a text-only agent succeeds 7% of the time; add the screenshot and it doubles to 15%; label the elements and it climbs again. 2026