← Research · The case for vision
Does the model actually need to see it?
Take the image away and GPT-4o's chart accuracy collapses from 86% to 30%. For a visual question, the answer is in the picture.
It's tempting to think a good enough description can stand in for the screenshot. Paste the error, summarize the chart, transcribe the table, then let the model reason over text. The published evidence points the other way: for anything visual, the image carries information that words reconstruct poorly, if at all.
The clearest demonstration: take the image away
He et al. measured exactly this. They ran GPT-4o on ChartQA twice: once normally, and once as a "blind" baseline where the model answers chart questions without seeing the chart, guessing from priors alone11Distill Visual Chart Reasoning Ability from LLMs to MLLMs (ReachQA) . The difference is large:
| GPT-4o on ChartQA | Accuracy |
|---|---|
| Guessing without the image | 30.0% |
| Shown the actual chart | 85.7% |
That is a 56-point swing from a single change: letting the model see. The blind score isn't zero, since some answer formats are guessable, which is why the authors built a harder benchmark. The conclusion holds either way: when the answer is in the pixels, textual framing does not recover it.
The study names GPT-4o because that is the model it tested, but the model generation isn't the point. This is a fact about inputs, not capability: a newer, smarter model still cannot reason about a chart it was only told about. The detail has to reach the model before any amount of intelligence can use it.
This holds at the frontier, too. A 2025 evaluation ran GPT-5 on the hardest chart questions in the CHART-6 benchmark and found two things at once: the newer model was 20 to 40 points more accurate than GPT-4o, and replacing the question with a generated text description of the chart gave no reliable gain over simply asking over the image, and for GPT-5 sometimes hurt44GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting . Better models raise the ceiling; a verbal stand-in for looking still does not.
Why a screenshot beats a transcript
A screenshot preserves what prose throws away: layout, spatial relationships, emphasis, the chart's actual shape, which value sits next to which label. Shpigel Nacson et al., building DocVLM, found that OCR text in the prompt "partially addresses" document understanding but "underperforms compared to full-resolution counterpart, as it lacks the complete visual context needed for optimal performance."22DocVLM: Make Your VLM an Efficient Reader In their measurements, giving the model the image lifts document QA well above a text-token-limited setup: DocVQA accuracy climbs from 56.0% to 86.6% on InternVL2 and from 84.4% to 91.2% on Qwen2-VL once the visual reading is restored.
The point isn't "vision instead of text." It's that the image carries information the words leave out. Describing your screen to a model discards detail the model could otherwise read straight off the pixels.
It's not just charts: fine visual perception is still hard
Even frontier models stumble on basic visual perception when they can't look closely. On BLINK, a suite of perception tasks humans solve almost trivially (~96%), the best models hover near 50%, barely above chance33BLINK: Multimodal Large Language Models Can See but Not Perceive . The implication for a tool like Noru is direct: give the model the real frame instead of a vague description.
The honest boundary
Vision isn't a free win on every input. On dense, text-only documents, a clean text layer can match or beat pure vision, which is why the next piece in this series is about the text layer, not against it. The rule is simple: send the image and its text, because each covers what the other misses.
When the information is visual (a chart, a UI, an error in context), the screenshot is the highest-fidelity thing you can hand a model. Noru's job is to make sending it effortless.
Sources
- Distill Visual Chart Reasoning Ability from LLMs to MLLMs (ReachQA), He et al., 2024
- DocVLM: Make Your VLM an Efficient Reader, Shpigel Nacson et al. (AWS AI Labs), 2024
- BLINK: Multimodal Large Language Models Can See but Not Perceive, Fu et al., 2024
- GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting, Yang & Chen, 2025