Module 5 · Interaction Models

Neural codecs, full-duplex voice, and the expressive face

This is the voice + face half of the creature. We tear down the classic voice-agent cascade, measure its latency floor, then replace it with a speech-native full-duplex model (Moshi) running on the phone — and give it a stylized face that lip-syncs and shows affect.

Track 2 · Voice

A · The cascade and its structural limits

The default way to build a voice agent is a pipeline of independent models. It works, it is easy to reason about, and it is the thing we want to beat. Before optimizing anything, we diagram it and measure where the milliseconds go.

The cascade — serial latency stack transport.input VAD+turn STTaudio→text LLMtext→text TTStext→audio transport.output ~30ms turn-wait +120ms +300ms TTFT +150ms TTFB ↳ latencies are additive & serial — and text in the middle deletes prosody / affect / precise timing Full-duplex (Moshi) — one model, two streams, no gate user audioin speech-to-speech LM (audio tokens) listens + speaks simultaneously agent audioout ↳ ≈160 ms floor (derived in §C) · barge-in & backchannels are free because nothing waits for a turn
Voice / interaction (Track 2) Agent output stream Transport / I/O Added latency
Top: the serial cascade we measure in Lab 5a. Bottom: the duplex model we build in Lab 5b.

Why the cascade has an irreducible latency floor

1 · Serial stages compound

VAD must decide the turn ended, then STT runs, then the LLM produces its first token, then TTS produces its first byte. Each waits for the previous. The floor is the sum, not the max.

2 · Text is a lossy bottleneck

STT collapses rich audio into characters. Prosody, affect, emphasis, laughter, hesitation, exact timing — all gone before the LLM ever sees the turn. TTS then re-hallucinates prosody from scratch.

3 · Rigid turn-taking

A VAD/turn gate forces strict alternation. No barge-in (interrupting the agent), no backchannels ("mm-hm" while you talk). Real conversation is overlapping; the gate forbids it.

The target

Removing the text bottleneck and the turn gate is exactly what a speech-native, full-duplex model does. Lab 5a's job is to make the floor a number you have measured, so the win in §C is concrete.

Lab 5a Stand up a Pipecat cascade — with metrics

Build the classic pipeline and turn on metrics so you can read the additive latency floor you are trying to beat.

Version-sensitive

Pipecat's processor/service APIs move fast. Pin versions and verify against docs.pipecat.ai/pipecat/get-started/quickstart before wiring the pipeline — class names and imports below are illustrative.

1 · Install

# one venv; the extras pull VAD, transports, providers, smart-turn
pip install "pipecat-ai[silero,webrtc,openai,deepgram,cartesia,runner,local-smart-turn-v3]" \
            pipecat-ai-small-webrtc-prebuilt python-dotenv loguru

2 · Build the pipeline (transport.input → STT → LLM → TTS → transport.output)

from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.task import PipelineTask, PipelineParams
from pipecat.audio.vad.silero import SileroVADAnalyzer
# services: DeepgramSTTService, OpenAILLMService, CartesiaTTSService, + a smart-turn analyzer

task = PipelineTask(
    Pipeline([
        transport.input(),     # Silero VAD + smart-turn-v3 on the transport
        stt, llm, tts,
        transport.output(),
    ]),
    params=PipelineParams(
        enable_metrics=True,        # ← the whole point of this lab
        enable_usage_metrics=True,
    ),
)

3 · Run, talk, read the metrics

# Pipecat logs TTFB / processing time per service when metrics are on.
# Sum: turn-detection wait + STT + LLM TTFT + TTS TTFB = your floor.

Expected output: per-stage TTFB lines in the log; add them up for a single end-to-end "mouth-closes → voice-starts" number (typically ~600–900 ms on commodity cloud services).

B · Neural audio codecs (RVQ)

To run a language model directly over audio, you need audio as a short stream of discrete tokens — not a 24 kHz waveform. Residual Vector Quantization is how a neural codec turns continuous audio into a tiny stack of integer codes per frame.

How RVQ works

Encode the audio to a latent vector x. A single codebook can only get close. So we quantize the residual with a stack of codebooks, each one cleaning up what the previous left behind:

xΣi=1N qi    with    qi = Quantizei( ri−1 )
r0 = x  ,   ri = ri−1qi    (each level encodes the leftover error)
N codebooks × log2(codebook size) bits per frame · more levels = higher fidelity
Semantic vs acoustic tokens

Semantic tokens carry what was said (phonetic / linguistic content — distilled from a self-supervised speech model). Acoustic tokens carry how it sounded (speaker timbre, prosody, recording detail). Codecs like Mimi reserve the first codebook for a semantic token and the rest for acoustic residual — so the LM can reason about content and reproduce voice from the same stream.

Mimi (Kyutai) — 12.5 Hz frames

Mimi is the codec under Moshi. The key number is its 12.5 Hz frame rate: only 12.5 token-steps per second of audio. A low frame rate is what makes a Transformer-over-audio tractable in real time — the LM only has to emit ~12.5 steps/s instead of hundreds.

Mimi: RVQ codebook stack → token stream @ 12.5 Hz audio encoder x codebook 1 · semantic codebook 2 · acoustic codebook 3 · residual codebook 4 · residual … ri= ri−1− qi down the stack … one frame = 80 ms → a column of N codes t t+1 t+2 t+3 → to the LM 12.5 columns / second — low rate ⇒ real-time LM-over-audio is feasible (symmetric Mimi decoder turns codes back into waveform at output)
Semantic token (codebook 1) Acoustic residual tokens One 80 ms frame column
Each frame is a short column of codes; the stream runs at 12.5 Hz. That low rate is the whole game.

How this maps to my build Mimi is the substrate Moshi already uses — I don't train a codec. I just need to know that each 80 ms frame is a column of RVQ codes, because that frame is also the unit I read for lip-sync RMS (§E) and the cadence behind the 160 ms latency derivation (§C).

C · Moshi — full-duplex speech-to-speech

Moshi (Kyutai) is a real-time, speech-to-speech foundation model. It is the core of our voice stack. Paper: arxiv.org/abs/2410.00037 · code: github.com/kyutai-labs/moshi.

Moshi = RQ-Transformer over multiple streams USER audio stream (in) MOSHI audio stream (out) both modelled jointly → true full-duplex TEMPORAL transformer Helium-7B · across time steps tt+1 t+2t+3 one column = one 80 ms frame zoom one step DEPTH transformer small · over RVQ levels ① TEXT token (inner monologue — predicted first) ② audio code lvl 1 ③ audio code lvl 2 ④ audio code lvl 3 … depth runs the codebooks within one step intent watcher → tool calls (pg 6) Mimi decode → spoken audio No VAD turn-gate anywhere: the model is always predicting its own next frame while ingesting yours.
Temporal / text stream (Track 2) Depth over RVQ levels Audio out User in / frame
The four ideas that beat the cascade: multi-stream, RQ-Transformer (temporal + depth), inner monologue, and a sub-200 ms latency floor.

① Multi-stream

Moshi models the user stream and its own speech stream simultaneously. Because it is always predicting its own next frame while ingesting yours, it can listen and speak at once — true full-duplex. Barge-in and backchannels fall out for free; there is no turn-taking gate to remove.

② RQ-Transformer

A large temporal transformer (Helium-7B) runs across time steps. A small depth transformer runs over the RVQ codebook levels within a step. Splitting time from depth is what lets a 7B model emit a multi-codebook frame every 80 ms.

③ Inner monologue

At each step Moshi predicts a text token before the audio tokens. This anchors what it is about to say (better coherence) and hands us a live text stream — the perfect place to watch for intent and fire tool calls.

④ The latency floor

At 12.5 Hz each frame is 80 ms. With the model's one-frame lookahead/processing, the theoretical end-to-end latency is two frames — ≈160 ms. Derived below.

The 160 ms latency derivation

frame duration  =  1 / (12.5 Hz)  =  80 ms
latencye2e  ≈  τframe  +  τlookahead/proc  =  80 ms + 80 ms  =  ≈ 160 ms
one frame to observe your audio · one frame for the model's processing/lookahead before it speaks
~600–900ms
cascade floor (Lab 5a)
≈160ms
Moshi theoretical floor
<500ms
target felt latency on phone
12.5Hz
Mimi frame rate
Lab 5b Run Moshi + build the intent watcher
Version-sensitive

Moshi runners and quantized weight names change. Verify flags against github.com/kyutai-labs/moshi before running.

Path A — MLX on Mac / phone-class hardware

pip install moshi_mlx
python -m moshi_mlx.local_web -q 4   # 4-bit quantized; opens a local web client

Path B — PyTorch on the GPU host (~24 GB VRAM)

pip install -U moshi
python -m moshi.server                # serves the full-precision model

Build the intent watcher on the Inner-Monologue text stream

  1. Subscribe to Moshi's text token stream (the inner-monologue output), not the audio.
  2. Accumulate tokens into rolling phrases; run a lightweight intent check (keyword/regex first, a small classifier later).
  3. On a physical-task intent ("pick up the…", "bring me…"), emit a do_physical_task tool call to the orchestrator. cross-link The full tool schema lives on page 6 · The Creature.
  4. The voice stream keeps running uninterrupted — the watcher is a passive tap, never a blocking gate.
def on_text_token(tok):
    buf.append(tok)
    if intent := detect_physical_task("".join(buf)):
        emit_tool_call("do_physical_task", intent)   # schema: page 6

Optional (lowest priority): persona finetune via github.com/kyutai-labs/moshi-finetune — skip if time-poor.

D · The frontier interaction-model design

Moshi is our working model. It also points at where this is going. A next-gen interaction model (the "TML"-style frontier design) keeps the same core moves and scales them:

Full-duplex & multimodal

Not just speech-to-speech — vision and other modalities folded into the same always-on duplex stream, so the model sees and hears while it talks.

~200 ms micro-turns

Conversation modelled as a stream of tiny overlapping micro-turns rather than long alternating turns — closer to how humans actually interleave.

MoE backbone

A Mixture-of-Experts backbone (e.g. ~276B total / ~12B active) — frontier capacity, but only a small active slice per token to stay real-time.

Async background reasoning

A slow deliberative stream that thinks in the background and never blocks the fast conversational stream. Fast loop talks; slow loop plans.

Course-wide principle

This is the same fast / slow decoupling that runs through the whole course: a tight reactive loop (here, the 160 ms duplex voice loop) protected from a slower deliberative loop (planning, tool reasoning, the world model on page 4). The creature must always sound responsive even while it is thinking.

How this maps to my build I can't run a 276B MoE on a phone — but I architect to the same pattern: Moshi as the fast duplex loop, the orchestrator (page 6) as the slow async reasoner, joined by the inner-monologue tap so deliberation never stalls the voice.

E · The expressive face

The voice loop makes the creature responsive; the face makes it alive. We render a stylized face in the phone browser (track2_voice/face/face.html), driven over a websocket by a tiny message: {emotion, mouth, speaking}.

Why stylized, not photorealistic — the uncanny valley

human-likeness → affinity → uncanny valley our stylized face cartoon near-but-not-quite human real human
Affinity rises with human-likeness, then dips sharply just below "real." A clearly stylized face sits safely on the left peak — we aim there on purpose.

Live face demo

This is the idea, running right here — blinking eyes, a mouth driven by CSS keyframes, idle sway, and a gaze that drifts toward the viewer. In the real build, mouth and emotion come from the websocket instead of fixed animations.

Demo only — buttons nudge the demo's expression. The production face takes {emotion, mouth, speaking} over a websocket.

Lip-sync from RMS

mouth = normalized RMS of Moshi's output-audio frame. Each 80 ms frame → one mouth-openness value → the mouth height follows the voice. Cheap, robust, no phoneme alignment needed for the baseline.

Affect from the state machine

emotion is set by the orchestrator's state machine (the affect map on page 6). The face is a pure renderer of state it is told; it owns no policy.

Two cheap "alive" wins

Gaze: phone-camera face-tracking biases the eyes toward the user. Backchannels: small nods / "mm-hm" expressions while listening (speaking=false), so it never looks frozen.

Idle life

Always-on micro-motion: blink on a randomized timer and a slow sway. A perfectly still face reads as dead; a little drift reads as alive.

Lab 5c Build face.html driven by a websocket
  1. Create track2_voice/face/face.html: a self-contained canvas/SVG face (eyes, pupils, mouth) with idle blink + sway as CSS keyframes — start from the live demo above.
  2. Open a WebSocket to the orchestrator; parse messages {emotion, mouth, speaking} as JSON.
  3. On each message: set mouth-openness from mouth (RMS), switch the expression preset from emotion, and toggle backchannel idle behaviour when speaking===false.
  4. Compute mouth upstream: per Moshi output frame, RMS of the samples, normalized to 0–1 (clamp + smooth).
  5. Add gaze: request the phone camera, run lightweight face-tracking, map the detected face offset to a small pupil bias.
  6. Open face.html fullscreen on the phone; point Moshi's audio-out at the RMS computer feeding the socket.
const ws = new WebSocket(WS_URL);
ws.onmessage = (e) => {
  const {emotion, mouth, speaking} = JSON.parse(e.data);
  setMouthOpen(mouth);          // 0..1 RMS → mouth ry
  setExpression(emotion);       // preset from page-6 affect map
  setBackchannel(!speaking);    // nod/“mm-hm” while listening
};

F · Affective computing theory

How do we represent "emotion" at all? Two classic models, and a viseme layer for the mouth.

Ekman — discrete basic emotions

A small fixed set (happy, sad, anger, fear, surprise, disgust). Easy to map to face presets — what we use for the v1 state machine. Limitation: blocky, no in-betweens.

Valence–Arousal — continuous

Two axes: valence (unpleasant↔pleasant) × arousal (calm↔excited). Any emotion is a point; the discrete labels are just regions. Lets us interpolate.

+ valence − valence + arousal − arousal excited happy angry afraid sad bored calm content interpolate
Discrete labels are just regions on the continuous plane — so we can blend "happy → calm" by sliding along it.

Visemes & coarticulation

For better lip-sync than raw RMS, map phonemes to visemes (the small set of visually distinct mouth shapes — many phonemes look identical). Coarticulation means a viseme is influenced by its neighbours, so we blend shapes over time rather than snapping between them. RMS gives a working baseline today; visemes are the upgrade once we have phoneme timing (the inner-monologue text stream is a natural source).

Upgrade path

v1 = Ekman discrete presets + RMS mouth. v2 = continuous valence–arousal interpolation for affect (smooth transitions) + viseme blending for the mouth. Same renderer, richer signals.

How this maps to my build The face renderer reads whatever the orchestrator sends. Today that's a discrete emotion string + an RMS mouth float. Moving to valence–arousal and visemes is a signal change, not a rewrite — exactly the kind of cheap upgrade the 14-day sprint leaves room for.

Track 2 — done when

Track 2 milestone
← Prev
4 · World Models
Next →
6 · The Creature