A build-to-learn course

Learn robot learning by building a creature.

You are a senior real-time-voice and privacy engineer. Over 14 days you will learn how Vision-Language-Action models, world models, and interaction models actually work — not by reading, but by bringing one small robot creature to life on the LeKiwi platform: an omni holonomic base + SO-101 arm that drives, grasps, and holds a natural spoken conversation behind an expressive face.

The hardware spine is deliberately small and honest:

LeKiwi base + SO-101 arm
Holonomic 3-wheel drive (x, y, θ) and a 6-DoF follower arm — one shared action vector.
Raspberry Pi 5
On-robot host daemon: motor I/O + camera streaming only. Nothing heavy.
2× USB cameras
A wrist view and a base/scene view for the policy.
Phone = head & face
Mic + speaker for voice; screen for the animated, emotive face.
14day sprint
3tracks · body · voice · creature
8topic pages
1shipped demo
Honest scope

What 14 days buys you: operational mastery of the full record → train → deploy → orchestrate loop, and a working demo of a creature that drives to an object, picks it up, and talks to you about it.

What it does NOT buy: training foundation models from scratch, or whole-apartment autonomy. We work in a single room, with semi-known object positions and tethered compute. That constraint is the whole point — it makes everything finishable, and every shortcut explicit.

The creature, end to end

Everything in this course exists to fill in one of the boxes below. Watch the signals move: voice in teal, body in clay, the orchestrator's creature-state in amber.

Phone · mic + cam audio + video in (the head) Interaction Model Moshi · full-duplex speech ⇄ speech, ~200ms Orchestrator state machine · the bus intent → task → affect VLA Runner SmolVLA · async obs → action chunks LeKiwi robot base (x,y,θ) + arm joints Pi 5 host daemon Phone · face emotion + mouth render Phone · speaker spoken reply out audio+video intent json task str action vector obs (cams + state) phase emotion, mouth audio say(phase)
Full ANIMA-Kiwi dataflow. The fast voice loop (teal) and slow body loop (clay) never block each other; the orchestrator (amber) is the only thing that talks to both.
Voice / interaction (Track 2) Body / VLA (Track 1) Orchestrator / creature (Track 3)

Two load-bearing invariants

Two design decisions shape every page and every lab. If you remember nothing else, remember these.

Invariant A
Tethered compute is the default, not a failure

The Pi 5 runs only the LeKiwi host daemon — motor I/O and camera streaming. A GPU host runs the VLA and the interaction model. The phone is the face and audio I/O.

"Phone runs everything on-device" is the north-star architecture — but it is not a Day-14 requirement. Tether first, ship the creature, then push compute toward the edge later. Pretending the Pi can run SmolVLA + Moshi is the fastest way to ship nothing.

Body Voice
Invariant B
Navigation lives in the action space, not a nav stack

LeKiwi's action vector already contains the holonomic base velocities (x, y, θ) alongside the arm joints. So a teleop demo can record driving up to the object AND grasping it as one continuous motion.

A single SmolVLA then learns approach + manipulation end-to-end. That is why Track 3's navigation needs no motion planner, no SLAM, no nav stack for the demo — the policy that grasps is the same policy that drove there.

at = [ vx, vy, ωθ, q1..6, grip ]  ∈  ℝ9+
Body Creature

The recurring principle: fast / slow decoupling

This pattern appears on nearly every page. A fast reactive loop keeps the creature alive and conversational; a slow deliberative loop does the heavy thinking. They run async — the fast loop must never block on the slow one.

FAST loop · ~200 ms full-duplex dialogue + immediate affect listen / speak express affect render face + mouth loops every frame, never waits SLOW loop · multi-second VLA planning + action rollout observe scene plan chunk (SmolVLA) execute phase events only slow loop reports → fast loop expresses
The slow loop emits phase changes ("approaching", "grasping", "got it") and the fast loop turns them into speech and facial affect. Same pattern as SmolVLA async inference and "background reasoning".
Why this matters

If the dialogue blocked on the policy, the creature would freeze mid-sentence every time it thought about moving — and feel dead. Decoupling is what makes it feel alive. You will implement this exact split in 6 · The Creature.

The learning path

Eight pages. Each one derives the theory, then hands you a runnable lab with a "done when" gate. Three of them assemble directly into the three build tracks.

01 · Foundations

What learning a policy even means

Learn imitation learning, covariate shift, why multimodality breaks naive regression.
Build a behavior-cloning baseline + the intuition for everything after.
Start here →
02 · Action Models

How modern policies output motion

Learn ACT / CVAE, diffusion policy, and flow matching for action generation.
Build an action-chunking head and compare it to diffusion sampling.
Open →
03 · VLAs & Body Track 1

SmolVLA, dissected and deployed

Learn how a VLA fuses vision + language + action; SmolVLA internals.
Build the full LeKiwi record → train → deploy pipeline.
Open →
04 · World Models

Imagining before acting

Learn RSSM, latent imagination, Dreamer and UniPi.
Build intuition for planning in latent space (reading + a small probe).
Open →
05 · Interaction Track 2

A voice that listens while it talks

Learn neural audio codecs, Moshi's full-duplex architecture, the expressive face.
Build a barge-in-capable spoken loop + a lip-synced animated face.
Open →
06 · The Creature Track 3

One mind from many models

Learn the orchestrator, voice→intent, navigation tiers, affect mapping.
Build the state machine + bus that fuse Track 1 and Track 2 into a creature.
Open →
07 · Plan & Demo

The 14-day sprint, scheduled

Learn how the pieces sequence, where the gates are, what "shipped" means.
Build your day-by-day plan + the demo runbook you'll perform on Day 14.
Open →
08 · Problem Set

Earn the intuition

Learn by deriving — flow matching, CVAE bounds, RSSM, async timing.
Build nothing physical; sharpen everything mental.
Open →

Three tracks, one creature

The capability pages cluster into three build tracks. Each track owns one color across the whole course, in diagrams and badges alike.

Track 1 · Body
The robot moves and grasps.
Covered by pages 1, 2, and built end-to-end in 3 · VLAs & Body. World-model theory in 4 feeds the body's "imagination".
Track 2 · Voice
The creature listens, speaks, and emotes.
Owned by 5 · Interaction — Moshi full-duplex dialogue and the expressive phone face.
Track 3 · Creature
One mind that ties body and voice together.
Owned by 6 · The Creature — the orchestrator, intent parsing, navigation-via-action-space, and affect. Sequenced in 7 · Plan & Demo.

The 14-day timeline

Week 1 builds each capability in isolation to a hard gate. Week 2 fuses them and ships. You are here: Day 0.

Week 1 — build each capability
Day 0You are here. Read the course, order parts.
Day 1Install + calibrate + teleop + cascade check.
Day 2Record ≥50 demos · stand up Moshi.
Day 3Fine-tune SmolVLA · design the face.
Day 4Deploy + eval VLA · face lip-sync.
Day 5World-models reading · barge-in.
Day 6Pick demo policy · voice → intent.
Day 7 · GATEVLA ≥ 60% success · expressive duplex working.
Week 2 — fuse & ship
Day 8Orchestrator + message bus.
Day 9Navigation (drive-to-object in action space).
Day 10Close the loop end-to-end.
Day 11Hardening · failure recovery.
Day 12HRI polish · affect tuning.
Day 13Rehearse + record the demo.
Day 14 · SHIPPerform demo · write up.
You are here Gate (Day 7) Ship (Day 14) Build day

Full per-day runbook with commands and exit criteria lives in 7 · Plan & Demo.

The study spine

Course depth · ETH Zürich

The theory backbone of ANIMA-Kiwi is ETH's Robot Learning: From Fundamentals to Foundation Models (Oier Mees, Spring 2026). Every topic page weaves in the relevant lectures and distills the key papers, then bends them toward what you actually need to build.

You don't watch the lectures and then build. You build, and the lectures explain why the thing you just ran behaves the way it does.

How to use this course

1 · Read the theory

Each page leads with a diagram or a worked equation. Derive, don't memorize. Every theory section ends with a maps to my build line tying it back to the creature.

2 · Run the lab

Every lab is numbered, runnable steps with expected output. Nothing is hand-wavy — if it's on the page, you can execute it on your hardware.

3 · Check the gate

Each lab ends with a "Done when" checklist. Don't advance until it's green. Gates are what keep a 14-day sprint from quietly slipping to 40 days.

Commands are version-sensitive

The tooling moves fast. Before running any lerobot, moshi, or pipecat command, verify the current flags with --help on your installed version — argument names and subcommands change between releases. When in doubt, trust --help over this page.

Day 0 lab Stand at the starting line
  1. Confirm your hardware list: LeKiwi base + SO-101 arm, Raspberry Pi 5, 2× USB cameras, a phone, and a GPU host on the same network.
  2. Read 1 · Foundations — understand covariate shift before you record a single demo.
  3. Skim 7 · Plan & Demo so you know what Day 7 and Day 14 demand of you.
  4. Pick your single demo task now (e.g. "drive to the mug and pick it up while we chat"). Constrain it. Smaller is shippable.

Done when:

↑ Back to top ANIMA·Kiwi — Home Next → 1 · Foundations