← Research · The case for vision

Does the model actually need to see it?

Take the image away and GPT-4o's chart accuracy collapses from 86% to 30%. For a visual question, the answer is in the picture.

It's tempting to think a good enough description can stand in for the screenshot. Paste the error, summarize the chart, transcribe the table, then let the model reason over text. The published evidence points the other way: for anything visual, the image carries information that words reconstruct poorly, if at all.

The clearest demonstration: take the image away

He et al. measured exactly this. They ran GPT-4o on ChartQA twice: once normally, and once as a "blind" baseline where the model answers chart questions without seeing the chart, guessing from priors alone¹¹Distill Visual Chart Reasoning Ability from LLMs to MLLMs (ReachQA) He et al., 2024. The difference is large:

GPT-4o on ChartQA	Accuracy
Guessing without the image	30.0%
Shown the actual chart	85.7%

That is a 56-point swing from a single change: letting the model see. The blind score isn't zero, since some answer formats are guessable, which is why the authors built a harder benchmark. The conclusion holds either way: when the answer is in the pixels, textual framing does not recover it.

The study names GPT-4o because that is the model it tested, but the model generation isn't the point. This is a fact about inputs, not capability: a newer, smarter model still cannot reason about a chart it was only told about. The detail has to reach the model before any amount of intelligence can use it.

This holds at the frontier, too. A 2025 evaluation ran GPT-5 on the hardest chart questions in the CHART-6 benchmark and found two things at once: the newer model was 20 to 40 points more accurate than GPT-4o, and replacing the question with a generated text description of the chart gave no reliable gain over simply asking over the image, and for GPT-5 sometimes hurt⁴⁴GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting Yang & Chen, 2025. Better models raise the ceiling; a verbal stand-in for looking still does not.

Why a screenshot beats a transcript

A screenshot preserves what prose throws away: layout, spatial relationships, emphasis, the chart's actual shape, which value sits next to which label. Shpigel Nacson et al., building DocVLM, found that OCR text in the prompt "partially addresses" document understanding but "underperforms compared to full-resolution counterpart, as it lacks the complete visual context needed for optimal performance."²²DocVLM: Make Your VLM an Efficient Reader Shpigel Nacson et al. (AWS AI Labs), 2024 In their measurements, giving the model the image lifts document QA well above a text-token-limited setup: DocVQA accuracy climbs from 56.0% to 86.6% on InternVL2 and from 84.4% to 91.2% on Qwen2-VL once the visual reading is restored.

The point isn't "vision instead of text." It's that the image carries information the words leave out. Describing your screen to a model discards detail the model could otherwise read straight off the pixels.

It's not just charts: fine visual perception is still hard

Even frontier models stumble on basic visual perception when they can't look closely. On BLINK, a suite of perception tasks humans solve almost trivially (~96%), the best models hover near 50%, barely above chance³³BLINK: Multimodal Large Language Models Can See but Not Perceive Fu et al., 2024. The implication for a tool like Noru is direct: give the model the real frame instead of a vague description.

The honest boundary

Vision isn't a free win on every input. On dense, text-only documents, a clean text layer can match or beat pure vision, which is why the next piece in this series is about the text layer, not against it. The rule is simple: send the image and its text, because each covers what the other misses.

When the information is visual (a chart, a UI, an error in context), the screenshot is the highest-fidelity thing you can hand a model. Noru's job is to make sending it effortless.

Sources

Distill Visual Chart Reasoning Ability from LLMs to MLLMs (ReachQA), He et al., 2024
DocVLM: Make Your VLM an Efficient Reader, Shpigel Nacson et al. (AWS AI Labs), 2024
BLINK: Multimodal Large Language Models Can See but Not Perceive, Fu et al., 2024
GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting, Yang & Chen, 2025