Ink-2
The fastest and most accurate speech to text model
Ranked #1 on accuracy, built for voice agents with semantic endpointing and industry-leading latency.
Built for Voice Agents
Four capabilities that make Ink the transcription layer production agents rely on.
Heard right the first time.
Accuracy
In practice
In a voice agent, the transcript is the foundation everything else builds on. A transcription error undermines the LLM input and takes the interaction in the wrong direction.
Ink-2's approach
Ink has the lowest Word Error Rate (WER) of any streaming STT model, natively handling structured data — phone numbers, dates, emails, currencies, and UUIDs.
Knows when you start and finish.
Conversational flow
In practice
A conversation has two critical moments — when a caller starts talking and when they finish. Miss the start and the agent misses the turn entirely. Trigger too early and the agent jumps in mid-thought.
Ink-2's approach
Ink-2 has native turn detection — turn.start and turn.end signaled directly by the model. Semantic endpointing determines turn end by meaning, not silence — so pauses mid-thought don't trigger the agent prematurely.
The caller stops talking. The agent starts thinking.
Speed — 88ms
In practice
When transcription is fast and consistent, the agent's response feels immediate. One slow transcript in ten means one call in ten where that readiness breaks.
Ink-2's approach
Ink is the fastest streaming ASR model — built on a custom inference engine purpose-built for real-time conversation. Time to final transcript is 0.1s.
Quality that doesn't cost more as you grow.
Cost efficiency
In practice
Voice is the most natural interface for communication. Getting cost and quality right at scale enables voice everywhere — the default interface across every agentic interaction.
Ink-2's approach
Ink's State Space Model architecture delivers 10–100x the throughput of transformers — lower compute cost at scale, with no quality tradeoffs.