ANIMA-Kiwi Course

The creature, end to end

Everything in this course exists to fill in one of the boxes below. Watch the signals move: voice in teal, body in clay, the orchestrator's creature-state in amber.

Full ANIMA-Kiwi dataflow. The fast voice loop (teal) and slow body loop (clay) never block each other; the orchestrator (amber) is the only thing that talks to both.

Voice / interaction (Track 2) Body / VLA (Track 1) Orchestrator / creature (Track 3)

Two load-bearing invariants

Two design decisions shape every page and every lab. If you remember nothing else, remember these.

Invariant A

Tethered compute is the default, not a failure

The Pi 5 runs only the LeKiwi host daemon — motor I/O and camera streaming. A GPU host runs the VLA and the interaction model. The phone is the face and audio I/O.

"Phone runs everything on-device" is the north-star architecture — but it is not a Day-14 requirement. Tether first, ship the creature, then push compute toward the edge later. Pretending the Pi can run SmolVLA + Moshi is the fastest way to ship nothing.

Body Voice

Invariant B

Navigation lives in the action space, not a nav stack

LeKiwi's action vector already contains the holonomic base velocities (x, y, θ) alongside the arm joints. So a teleop demo can record driving up to the object AND grasping it as one continuous motion.

A single SmolVLA then learns approach + manipulation end-to-end. That is why Track 3's navigation needs no motion planner, no SLAM, no nav stack for the demo — the policy that grasps is the same policy that drove there.

at = [ vx, vy, ωθ, q1..6, grip ] ∈ ℝ9+

Body Creature

The recurring principle: fast / slow decoupling

This pattern appears on nearly every page. A fast reactive loop keeps the creature alive and conversational; a slow deliberative loop does the heavy thinking. They run async — the fast loop must never block on the slow one.

The slow loop emits phase changes ("approaching", "grasping", "got it") and the fast loop turns them into speech and facial affect. Same pattern as SmolVLA async inference and "background reasoning".

Why this matters

If the dialogue blocked on the policy, the creature would freeze mid-sentence every time it thought about moving — and feel dead. Decoupling is what makes it feel alive. You will implement this exact split in 6 · The Creature.

The learning path

Eight pages. Each one derives the theory, then hands you a runnable lab with a "done when" gate. Three of them assemble directly into the three build tracks.

01 · Foundations

Three tracks, one creature

The capability pages cluster into three build tracks. Each track owns one color across the whole course, in diagrams and badges alike.

Track 1 · Body

The robot moves and grasps.

Covered by pages 1, 2, and built end-to-end in 3 · VLAs & Body. World-model theory in 4 feeds the body's "imagination".

Track 2 · Voice

The creature listens, speaks, and emotes.

Owned by 5 · Interaction — Moshi full-duplex dialogue and the expressive phone face.

Track 3 · Creature

One mind that ties body and voice together.

Owned by 6 · The Creature — the orchestrator, intent parsing, navigation-via-action-space, and affect. Sequenced in 7 · Plan & Demo.

The 14-day timeline

Week 1 builds each capability in isolation to a hard gate. Week 2 fuses them and ships. You are here: Day 0.

Week 1 — build each capability

Day 0You are here. Read the course, order parts.

Day 1Install + calibrate + teleop + cascade check.

Day 2Record ≥50 demos · stand up Moshi.

Day 3Fine-tune SmolVLA · design the face.

Day 4Deploy + eval VLA · face lip-sync.

Day 5World-models reading · barge-in.

Day 6Pick demo policy · voice → intent.

Day 7 · GATEVLA ≥ 60% success · expressive duplex working.

Week 2 — fuse & ship

Day 8Orchestrator + message bus.

Day 9Navigation (drive-to-object in action space).

Day 10Close the loop end-to-end.

Day 11Hardening · failure recovery.

Day 12HRI polish · affect tuning.

Day 13Rehearse + record the demo.

Day 14 · SHIPPerform demo · write up.

You are here Gate (Day 7) Ship (Day 14) Build day

Full per-day runbook with commands and exit criteria lives in 7 · Plan & Demo.

The study spine

Course depth · ETH Zürich

The theory backbone of ANIMA-Kiwi is ETH's Robot Learning: From Fundamentals to Foundation Models (Oier Mees, Spring 2026). Every topic page weaves in the relevant lectures and distills the key papers, then bends them toward what you actually need to build.

You don't watch the lectures and then build. You build, and the lectures explain why the thing you just ran behaves the way it does.

How to use this course

1 · Read the theory

Each page leads with a diagram or a worked equation. Derive, don't memorize. Every theory section ends with a maps to my build line tying it back to the creature.

2 · Run the lab

Every lab is numbered, runnable steps with expected output. Nothing is hand-wavy — if it's on the page, you can execute it on your hardware.

3 · Check the gate

Each lab ends with a "Done when" checklist. Don't advance until it's green. Gates are what keep a 14-day sprint from quietly slipping to 40 days.

Commands are version-sensitive

The tooling moves fast. Before running any lerobot, moshi, or pipecat command, verify the current flags with --help on your installed version — argument names and subcommands change between releases. When in doubt, trust --help over this page.

Day 0 lab Stand at the starting line

Confirm your hardware list: LeKiwi base + SO-101 arm, Raspberry Pi 5, 2× USB cameras, a phone, and a GPU host on the same network.
Read 1 · Foundations — understand covariate shift before you record a single demo.
Skim 7 · Plan & Demo so you know what Day 7 and Day 14 demand of you.
Pick your single demo task now (e.g. "drive to the mug and pick it up while we chat"). Constrain it. Smaller is shippable.

Done when:

Every hardware item is accounted for and on one network.
You can state your one demo task in a single sentence with fixed-ish object positions.
You understand the fast/slow split and both invariants well enough to explain them.

The creature, end to end

Two load-bearing invariants

The recurring principle: fast / slow decoupling

The learning path

What learning a policy even means

How modern policies output motion

SmolVLA, dissected and deployed

Imagining before acting

A voice that listens while it talks

One mind from many models

The 14-day sprint, scheduled

Earn the intuition

Three tracks, one creature

The 14-day timeline

The study spine

How to use this course

1 · Read the theory

2 · Run the lab

3 · Check the gate