A · The cascade and its structural limits
The default way to build a voice agent is a pipeline of independent models. It works, it is easy to reason about, and it is the thing we want to beat. Before optimizing anything, we diagram it and measure where the milliseconds go.
Why the cascade has an irreducible latency floor
1 · Serial stages compound
VAD must decide the turn ended, then STT runs, then the LLM produces its first token, then TTS produces its first byte. Each waits for the previous. The floor is the sum, not the max.
2 · Text is a lossy bottleneck
STT collapses rich audio into characters. Prosody, affect, emphasis, laughter, hesitation, exact timing — all gone before the LLM ever sees the turn. TTS then re-hallucinates prosody from scratch.
3 · Rigid turn-taking
A VAD/turn gate forces strict alternation. No barge-in (interrupting the agent), no backchannels ("mm-hm" while you talk). Real conversation is overlapping; the gate forbids it.
Removing the text bottleneck and the turn gate is exactly what a speech-native, full-duplex model does. Lab 5a's job is to make the floor a number you have measured, so the win in §C is concrete.
Build the classic pipeline and turn on metrics so you can read the additive latency floor you are trying to beat.
Pipecat's processor/service APIs move fast. Pin versions and verify against docs.pipecat.ai/pipecat/get-started/quickstart before wiring the pipeline — class names and imports below are illustrative.
1 · Install
# one venv; the extras pull VAD, transports, providers, smart-turn pip install "pipecat-ai[silero,webrtc,openai,deepgram,cartesia,runner,local-smart-turn-v3]" \ pipecat-ai-small-webrtc-prebuilt python-dotenv loguru
2 · Build the pipeline (transport.input → STT → LLM → TTS → transport.output)
from pipecat.pipeline.pipeline import Pipeline from pipecat.pipeline.task import PipelineTask, PipelineParams from pipecat.audio.vad.silero import SileroVADAnalyzer # services: DeepgramSTTService, OpenAILLMService, CartesiaTTSService, + a smart-turn analyzer task = PipelineTask( Pipeline([ transport.input(), # Silero VAD + smart-turn-v3 on the transport stt, llm, tts, transport.output(), ]), params=PipelineParams( enable_metrics=True, # ← the whole point of this lab enable_usage_metrics=True, ), )
3 · Run, talk, read the metrics
# Pipecat logs TTFB / processing time per service when metrics are on. # Sum: turn-detection wait + STT + LLM TTFT + TTS TTFB = your floor.
Expected output: per-stage TTFB lines in the log; add them up for a single end-to-end "mouth-closes → voice-starts" number (typically ~600–900 ms on commodity cloud services).
- The cascade runs and you can hold a (turn-based) conversation through the browser transport.
- You can read off the additive latency floor from the metrics — write the number down; it is your baseline for §C.
B · Neural audio codecs (RVQ)
To run a language model directly over audio, you need audio as a short stream of discrete tokens — not a 24 kHz waveform. Residual Vector Quantization is how a neural codec turns continuous audio into a tiny stack of integer codes per frame.
How RVQ works
Encode the audio to a latent vector x. A single codebook can only get close. So we quantize the residual with a stack of codebooks, each one cleaning up what the previous left behind:
Semantic tokens carry what was said (phonetic / linguistic content — distilled from a self-supervised speech model). Acoustic tokens carry how it sounded (speaker timbre, prosody, recording detail). Codecs like Mimi reserve the first codebook for a semantic token and the rest for acoustic residual — so the LM can reason about content and reproduce voice from the same stream.
Mimi (Kyutai) — 12.5 Hz frames
Mimi is the codec under Moshi. The key number is its 12.5 Hz frame rate: only 12.5 token-steps per second of audio. A low frame rate is what makes a Transformer-over-audio tractable in real time — the LM only has to emit ~12.5 steps/s instead of hundreds.
How this maps to my build Mimi is the substrate Moshi already uses — I don't train a codec. I just need to know that each 80 ms frame is a column of RVQ codes, because that frame is also the unit I read for lip-sync RMS (§E) and the cadence behind the 160 ms latency derivation (§C).
C · Moshi — full-duplex speech-to-speech
Moshi (Kyutai) is a real-time, speech-to-speech foundation model. It is the core of our voice stack. Paper: arxiv.org/abs/2410.00037 · code: github.com/kyutai-labs/moshi.
① Multi-stream
Moshi models the user stream and its own speech stream simultaneously. Because it is always predicting its own next frame while ingesting yours, it can listen and speak at once — true full-duplex. Barge-in and backchannels fall out for free; there is no turn-taking gate to remove.
② RQ-Transformer
A large temporal transformer (Helium-7B) runs across time steps. A small depth transformer runs over the RVQ codebook levels within a step. Splitting time from depth is what lets a 7B model emit a multi-codebook frame every 80 ms.
③ Inner monologue
At each step Moshi predicts a text token before the audio tokens. This anchors what it is about to say (better coherence) and hands us a live text stream — the perfect place to watch for intent and fire tool calls.
④ The latency floor
At 12.5 Hz each frame is 80 ms. With the model's one-frame lookahead/processing, the theoretical end-to-end latency is two frames — ≈160 ms. Derived below.
The 160 ms latency derivation
Moshi runners and quantized weight names change. Verify flags against github.com/kyutai-labs/moshi before running.
Path A — MLX on Mac / phone-class hardware
pip install moshi_mlx
python -m moshi_mlx.local_web -q 4 # 4-bit quantized; opens a local web client
Path B — PyTorch on the GPU host (~24 GB VRAM)
pip install -U moshi
python -m moshi.server # serves the full-precision model
Build the intent watcher on the Inner-Monologue text stream
- Subscribe to Moshi's text token stream (the inner-monologue output), not the audio.
- Accumulate tokens into rolling phrases; run a lightweight intent check (keyword/regex first, a small classifier later).
- On a physical-task intent ("pick up the…", "bring me…"), emit a
do_physical_tasktool call to the orchestrator. cross-link The full tool schema lives on page 6 · The Creature. - The voice stream keeps running uninterrupted — the watcher is a passive tap, never a blocking gate.
def on_text_token(tok): buf.append(tok) if intent := detect_physical_task("".join(buf)): emit_tool_call("do_physical_task", intent) # schema: page 6
Optional (lowest priority): persona finetune via github.com/kyutai-labs/moshi-finetune — skip if time-poor.
- A sub-~500 ms-feel, interruptible spoken conversation runs on the phone (Path A) or host (Path B).
- The intent watcher fires
do_physical_taskfrom the text stream without interrupting speech.
D · The frontier interaction-model design
Moshi is our working model. It also points at where this is going. A next-gen interaction model (the "TML"-style frontier design) keeps the same core moves and scales them:
Full-duplex & multimodal
Not just speech-to-speech — vision and other modalities folded into the same always-on duplex stream, so the model sees and hears while it talks.
~200 ms micro-turns
Conversation modelled as a stream of tiny overlapping micro-turns rather than long alternating turns — closer to how humans actually interleave.
MoE backbone
A Mixture-of-Experts backbone (e.g. ~276B total / ~12B active) — frontier capacity, but only a small active slice per token to stay real-time.
Async background reasoning
A slow deliberative stream that thinks in the background and never blocks the fast conversational stream. Fast loop talks; slow loop plans.
This is the same fast / slow decoupling that runs through the whole course: a tight reactive loop (here, the 160 ms duplex voice loop) protected from a slower deliberative loop (planning, tool reasoning, the world model on page 4). The creature must always sound responsive even while it is thinking.
How this maps to my build I can't run a 276B MoE on a phone — but I architect to the same pattern: Moshi as the fast duplex loop, the orchestrator (page 6) as the slow async reasoner, joined by the inner-monologue tap so deliberation never stalls the voice.
E · The expressive face
The voice loop makes the creature responsive; the face makes it alive. We render a stylized
face in the phone browser (track2_voice/face/face.html), driven over a websocket by a
tiny message: {emotion, mouth, speaking}.
Why stylized, not photorealistic — the uncanny valley
Live face demo
This is the idea, running right here — blinking eyes, a mouth driven by CSS keyframes, idle sway, and a gaze that drifts toward the viewer. In the real build, mouth and emotion come from the websocket instead of fixed animations.
{emotion, mouth, speaking} over a websocket.Lip-sync from RMS
mouth = normalized RMS of Moshi's output-audio frame. Each 80 ms frame → one mouth-openness value → the mouth height follows the voice. Cheap, robust, no phoneme alignment needed for the baseline.
Affect from the state machine
emotion is set by the orchestrator's state machine (the affect map on page 6). The face is a pure renderer of state it is told; it owns no policy.
Two cheap "alive" wins
Gaze: phone-camera face-tracking biases the eyes toward the user. Backchannels: small nods / "mm-hm" expressions while listening (speaking=false), so it never looks frozen.
Idle life
Always-on micro-motion: blink on a randomized timer and a slow sway. A perfectly still face reads as dead; a little drift reads as alive.
- Create
track2_voice/face/face.html: a self-contained canvas/SVG face (eyes, pupils, mouth) with idle blink + sway as CSS keyframes — start from the live demo above. - Open a
WebSocketto the orchestrator; parse messages{emotion, mouth, speaking}as JSON. - On each message: set mouth-openness from
mouth(RMS), switch the expression preset fromemotion, and toggle backchannel idle behaviour whenspeaking===false. - Compute
mouthupstream: per Moshi output frame, RMS of the samples, normalized to 0–1 (clamp + smooth). - Add gaze: request the phone camera, run lightweight face-tracking, map the detected face offset to a small pupil bias.
- Open
face.htmlfullscreen on the phone; point Moshi's audio-out at the RMS computer feeding the socket.
const ws = new WebSocket(WS_URL); ws.onmessage = (e) => { const {emotion, mouth, speaking} = JSON.parse(e.data); setMouthOpen(mouth); // 0..1 RMS → mouth ry setExpression(emotion); // preset from page-6 affect map setBackchannel(!speaking); // nod/“mm-hm” while listening };
- The phone face lip-syncs to audio RMS (mouth tracks Moshi's voice).
- The face shifts affect on emotion messages and blinks/sways when idle.
- Gaze biases toward the user and backchannels appear while listening.
F · Affective computing theory
How do we represent "emotion" at all? Two classic models, and a viseme layer for the mouth.
Ekman — discrete basic emotions
A small fixed set (happy, sad, anger, fear, surprise, disgust). Easy to map to face presets — what we use for the v1 state machine. Limitation: blocky, no in-betweens.
Valence–Arousal — continuous
Two axes: valence (unpleasant↔pleasant) × arousal (calm↔excited). Any emotion is a point; the discrete labels are just regions. Lets us interpolate.
Visemes & coarticulation
For better lip-sync than raw RMS, map phonemes to visemes (the small set of visually distinct mouth shapes — many phonemes look identical). Coarticulation means a viseme is influenced by its neighbours, so we blend shapes over time rather than snapping between them. RMS gives a working baseline today; visemes are the upgrade once we have phoneme timing (the inner-monologue text stream is a natural source).
v1 = Ekman discrete presets + RMS mouth. v2 = continuous valence–arousal interpolation for affect (smooth transitions) + viseme blending for the mouth. Same renderer, richer signals.
How this maps to my build The face renderer reads whatever the orchestrator sends. Today that's a discrete emotion string + an RMS mouth float. Moving to valence–arousal and visemes is a signal change, not a rewrite — exactly the kind of cheap upgrade the 14-day sprint leaves room for.
Track 2 — done when
- The phone holds a sub-~500 ms-feel, interruptible spoken conversation (Moshi, Lab 5b).
- The face lip-syncs to audio RMS and shifts affect over the websocket (Lab 5c).
- You can articulate why removing the text bottleneck (Moshi / full-duplex) beats the cascade — and back it with the latency number you measured in Lab 5a vs the ≈160 ms floor.