← Research · Audio and perception

Let the user narrate the frame

Speech can be nearly 3x faster than mobile typing in controlled text-entry tests. The useful handoff is not just the frame; it is the frame plus why it matters.

A screenshot answers one question: what is visible? It does not answer the user's intent. If the user is staring at a UI and says "this keeps failing" or "look at the number in the corner," the words are cheap, but only because the visual scene does the grounding.

That is the audio case for Noru: not a voice assistant bolted onto the product, and not always-on listening. A short mic transcript can act like the missing caption on a screen capture.

Speech is cheap context

Ruan et al. compared speech recognition with touch keyboard entry for short messages in English and Mandarin under controlled conditions. Speech was 2.93x faster than keyboard entry in English (153 vs. 52 WPM) and 2.87x faster in Mandarin (123 vs. 43 WPM)¹¹Comparing Speech and Keyboard Text Entry for Short Messages in Two Languages on Touchscreen Phones Ruan et al., 2017.

The caveat matters: this was a lab study on short message transcription, not proof that every technical prompt is better dictated. But it supports the practical point. If the user only needs to add one sentence of intent to a screenshot, speaking it can be lower-friction than turning the capture into a carefully typed prompt.

Layer	What it contributes	What it misses
Screenshot	The visible state, layout, emphasis	Why the user captured it
Ordered text	Exact on-screen words	The user's current intent
Mic transcript	Intent, uncertainty, deictic cues	The pixels it points at

The transcript is most useful when it is attached to the frame. Alone, "what's wrong with this?" is under-specified. Next to the screenshot, it is enough.

Speech points at things

HCI has studied this shape for years: people naturally use speech with context. Lee et al.'s GazePointAR work describes voice assistants that fail because they lack spatiotemporal context, then adds gaze, pointing, and conversation history so users can ask pronoun-heavy questions like "what's over there?"²²GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality Lee et al., 2024 That is AR, not desktop software, so it should be used as an analogy, not a direct claim about Noru. The shared lesson is still useful: speech often depends on a visible referent.

On a Mac, the visible referent is usually the screen. The user says "this button," "that red line," "the thing it just printed." The capture supplies the referent; the transcript supplies the intent.

Models are moving toward the same packet

The model side is converging on this too. OpenAI describes GPT-4o as accepting combinations of text, audio, image, and video, trained across text, vision, and audio³³GPT-4o System Card OpenAI, 2024. Qwen2-Audio describes a model that accepts audio signals, supports voice chat and audio analysis, and can respond to audio containing sounds, multi-speaker conversation, and a voice command⁴⁴Qwen2-Audio Technical Report Chu et al., 2024.

That does not mean Noru should claim full audio perception today. The honest shipping surface is narrower: a screen capture with ordered text, plus mic speech as narration where available. System audio is a future surface and should stay named that way until it is implemented.

The useful unit is not "a screenshot" or "a voice note." It is the screen, the readable text, and the user's short narration in one model-ready handoff.

Sources

Comparing Speech and Keyboard Text Entry for Short Messages in Two Languages on Touchscreen Phones, Ruan et al., 2017
GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality, Lee et al., 2024
GPT-4o System Card, OpenAI, 2024
Qwen2-Audio Technical Report, Chu et al., 2024