Speak / See
Why the future of human-machine interface is already wired to our biology. Yet, we keep installing it backwards
I. The Typing Tax
A transformative morning of 2025 started in the rain.
Mind racing, ideas stacking faster than I could hold them — one hand gripped the leash while Winston, my husky-malamute, investigated a storm drain. Portland rain streaking the phone screen. No chance of typing anything.
So I fired up dictation and rambled for 30 seconds.
Those 30 seconds replaced minutes of thumb-typing. A 5× speed-up delivered by the oldest I/O protocol we own: human speech.
Yet almost every AI product still greets me with a blinking cursor, quietly billing me the difference.
II. The 10-Bit Funnel
Cognitive scientists call it the "conscious bandwidth wall."
Roughly one billion bits of sensory data arrive every second. Ten bits make it through to deliberate thought.
Speech inputs at ~39 bits per second—four times the typing rate—because evolution spent a million years automating tongue, jaw, and breath while our forebrain stayed free to scan for leopards.
Vision exits the wall in parallel. Color, motion, depth, and pattern run down separate neural highways that merge into meaning long before the bottleneck.
Text does neither. It forces serial entry (one thumb-letter at a time) and serial exit (one word at a time), politely waiting in line for the same ten-bit cashier.
Every interface that couples "type-in / read-out" is a perfect waste of a billion-dollar cortex.
III. Preattentive Pop-Out and the Red-Dot Economy
Before you can form the word "alert," your visual system has already flagged the red pixel among gray ones.
Thirteen milliseconds—faster than an eye-blink—is all it takes to categorize "different."
No meeting required. No OKR. No Jira ticket.
Stripe's dashboard used to email you: "Payment dispute created."
Now it shows a single red dot on a graph. You see the dot, you know the story. The ten-bit lane stays clear for the actual work.
That is not a cosmetic upgrade. It is a payroll saving.
IV. Tufte's Sparkline and the Vanishing Paragraph
In 1983 Edward Tufte quietly proved that a word-sized graphic could replace a paragraph of prose without losing a single datum.
We still publish paragraphs.
Inside most SaaS apps the "weekly summary" arrives as five sentences you pretend to read while scrolling for the bold number.
Replace those sentences with a sparkline — one-inch trend, zero annotation — and comprehension time drops from 18 seconds to 0.8 seconds.
The user does not feel "informed." The user feels psychic.
Psychic scales.
V. Norman's Button and the Zero-Word Instruction
Don Norman's great disclosure: a well-designed object removes the need for instructions altogether.
A convex disk on a door says "push" at 0 bits per second. A flat bar says "pull."
Every time your product ships with a tooltip reading "Click here to continue," you are admitting the affordance failed — and billing the user the difference in cognitive coin.
The best AI features of 2025 — Granola's meeting recaps, Cursor's tab-complete, Linear's cmd-k everything — win because they removed the sentence, not because they added more model.
VI. Real-Time Simulation, or the End of "What If"
The human brain learns by varying a parameter and watching the world tilt.
Text interrupts that loop: change, read, parse, imagine, decide. Each step a 10-bit toll.
Visual simulation collapses the stack: change, see.
Give a shopper a color picker that repaints the couch in real time, and he will try eighteen shades before committing to one.
The gain is not "faster browsing." The gain is fewer returns and the confidence to click Buy Now instead of Save for Later.
Every AI wrapper that answers with a paragraph instead of a moving picture is charging compound interest against its own roadmap.
VII. The Ambiguity Dividend
Speech is sloppy — homophones, ums, half-finished clauses — yet we tolerate the mess because repair is free.
"Set alarm six... no, seven."
Typing forces precision up-front. Speech lets precision emerge.
Product teams keep trying to "fix" speech by making it as exact as typing. The winning products do the opposite: they keep the slush, surface the top three interpretations as chips, and let one tap finish the sentence.
Ambiguity becomes throughput instead of error.
VIII. Boundary Lines
There are places where Speak / See fails.
Passwords. Libraries. Planes. Trading floors. Bedrooms at 2 a.m.
Text input still owns the narrow band where privacy, precision, or silence outweigh speed. Voice output still owns the windshield and the running trail.
Everywhere else, the hierarchy is merciless: voice in, visuals out.
Ignore it and you are not "offering options." You are taxing neurons your competitor has already freed.
IX. Metrics That Notice the Wall
Conscious-bandwidth KPIs feel weird in a quarterly deck, but they are the only numbers that move with the physics instead of against it.
Time-to-insight: seconds from user intent to correct action.
Cognitive offload: percent of tasks completed without scrolling or re-reading.
Iteration velocity: parameter tweaks per minute during exploration.
Error-recovery latency: milliseconds from speech ambiguity to user-confirmed resolution.
When these four lines bend, NPS and ARR follow like obedient dogs.
X. The Asymmetry Is Not Coming — It Is Waiting
Evolution did not build the human brain to please product managers.
It built a creature that could whisper a warning or spot a red berry in bush-shadow faster than a predator could pounce.
Every AI startup that greets that creature with a blank text box is staging an uphill battle against a billion years of optimization.
Some companies have already noticed. Their products feel faster than they are. Their users feel smarter than the features should allow. The cognitive tax is falling somewhere else.
You can debate whether speak/see is the future. Or you can watch your competitors stop charging the tax and wonder, later, when you lost.