← Research · Audio and perception
Let the user narrate the frame
Speech can be nearly 3x faster than mobile typing in controlled text-entry tests. The useful handoff is not just the frame; it is the frame plus why it matters.
A screenshot answers one question: what is visible? It does not answer the user's intent. If the user is staring at a UI and says "this keeps failing" or "look at the number in the corner," the words are cheap, but only because the visual scene does the grounding.
That is the audio case for Noru: not a voice assistant bolted onto the product, and not always-on listening. A short mic transcript can act like the missing caption on a screen capture.
Speech is cheap context
Ruan et al. compared speech recognition with touch keyboard entry for short messages in English and Mandarin under controlled conditions. Speech was 2.93x faster than keyboard entry in English (153 vs. 52 WPM) and 2.87x faster in Mandarin (123 vs. 43 WPM)11Comparing Speech and Keyboard Text Entry for Short Messages in Two Languages on Touchscreen Phones .
The caveat matters: this was a lab study on short message transcription, not proof that every technical prompt is better dictated. But it supports the practical point. If the user only needs to add one sentence of intent to a screenshot, speaking it can be lower-friction than turning the capture into a carefully typed prompt.
| Layer | What it contributes | What it misses |
|---|---|---|
| Screenshot | The visible state, layout, emphasis | Why the user captured it |
| Ordered text | Exact on-screen words | The user's current intent |
| Mic transcript | Intent, uncertainty, deictic cues | The pixels it points at |
The transcript is most useful when it is attached to the frame. Alone, "what's wrong with this?" is under-specified. Next to the screenshot, it is enough.
Speech points at things
HCI has studied this shape for years: people naturally use speech with context. Lee et al.'s GazePointAR work describes voice assistants that fail because they lack spatiotemporal context, then adds gaze, pointing, and conversation history so users can ask pronoun-heavy questions like "what's over there?"22GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality That is AR, not desktop software, so it should be used as an analogy, not a direct claim about Noru. The shared lesson is still useful: speech often depends on a visible referent.
On a Mac, the visible referent is usually the screen. The user says "this button," "that red line," "the thing it just printed." The capture supplies the referent; the transcript supplies the intent.
Models are moving toward the same packet
The model side is converging on this too. OpenAI describes GPT-4o as accepting combinations of text, audio, image, and video, trained across text, vision, and audio33GPT-4o System Card . Qwen2-Audio describes a model that accepts audio signals, supports voice chat and audio analysis, and can respond to audio containing sounds, multi-speaker conversation, and a voice command44Qwen2-Audio Technical Report .
That does not mean Noru should claim full audio perception today. The honest shipping surface is narrower: a screen capture with ordered text, plus mic speech as narration where available. System audio is a future surface and should stay named that way until it is implemented.
The useful unit is not "a screenshot" or "a voice note." It is the screen, the readable text, and the user's short narration in one model-ready handoff.
Sources
- Comparing Speech and Keyboard Text Entry for Short Messages in Two Languages on Touchscreen Phones, Ruan et al., 2017
- GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality, Lee et al., 2024
- GPT-4o System Card, OpenAI, 2024
- Qwen2-Audio Technical Report, Chu et al., 2024