Plan & Demo — ANIMA-Kiwi

A · The 14-Day Sprint

You are at Day 0. Hardware is on the bench, repos are cloned. The strip below is your map; the cards under it are your daily marching orders. Each card is colour-coded by its dominant track and links the page that teaches the how-to.

◀ Week 1 · build each capability

Week 2 · fuse & ship ▶

You are here · Day 0

install·teleop

record·Moshi

finetune·face

deploy·lipsync

barge-in

policy·intent

GATE

orchestrator

navigation

D10

close loop

D11

harden

D12

HRI polish

D13

rehearse

D14

SHIP

The sprint strip — Day 0 (you are here) → Day 14 (ship). Axis colour shows the dominant track per stretch; ◆ D7 is the hard gate, ● D14 is the ship line.

Week 1 — build each capability

Day 1 · T1 T2

Install, calibrate, teleop

LeRobot installed; LeKiwi calibrated; clean teleop loop
Stand up the Pipecat cascade (STT→LLM→TTS) baseline

→ VLAs & Body

Day 2 · T1 T2

Record demos · run Moshi

Record ≥50 demos — include driving up to the object, not just the grasp
Get Moshi running locally (full-duplex audio)

→ VLAs & Body

Day 3 · T1 T2

Fine-tune SmolVLA · design face

Fine-tune SmolVLA on your dataset (GPU)
Design the expressive face: emotion set + visemes

→ VLAs & Body

Day 4 · T1 T2

Deploy + eval VLA · face v1

Deploy policy_server; run eval rollouts; baseline success %
Face v1 on the phone with lip-sync to TTS

→ VLAs & Body

Day 5 · Theory T1 T2

World models · push VLA · barge-in

Read the world-models material (Lecture 8 spine)
Push VLA success up; clean up failure modes
Wire Moshi → face with barge-in (interruptible)

→ World Models

Day 6 · T1 T2 T3

Pick demo policy · wrap voice→intent

Choose the policy you will demo; freeze its checkpoint
Wrap voice → structured intent (the bridge to T3)

→ Interaction

Gate

Day 7 · T1 T2

GATE — week-1 capability bar

VLA ≥ 60% language-prompted drive-and-pick
Expressive duplex conversation working end-to-end
Buffer day for hardware gremlins — do not skip it

→ Interaction

The gate is the whole point If you are not through Day 7 by Day 7, do not start Week 2. Fusing two unreliable subsystems gives you one very unreliable creature. Spend the buffer on whichever track is below bar.

Week 2 — fuse & ship

Day 8 · T3

Orchestrator + message bus

Stand up the orchestrator state machine
Message bus wiring all subsystems together

→ The Creature

Day 9 · T3

Navigation tier A/B

Pick the navigation tier: A (VLA drives) or B (explicit nav)
Get "drive to vicinity of object" reliable

→ The Creature

Day 10 · T3

Close the loop end-to-end

Spoken command → ack → drive → grasp → report
First full run, however shaky

→ The Creature

Day 11 · T3

Reliability hardening

Kill flaky transitions, add timeouts + retries
Measure reliability over repeated trials

→ The Creature

Day 12 · T2 T3

HRI polish

Gaze, backchannels ("mm-hm"), affect transitions
Make the creature feel alive, not robotic

→ Interaction

Day 13 · T3

Rehearse + record

Run the demo script start-to-finish, many times
Record clean takes; capture metrics overlay

→ The Creature

Ship

Day 14 · T3

SHIP — release + retro

Publish the demo video + RUNBOOK.md
Write-up: what worked, what didn't, metrics
Retro: the honest postmortem

→ The Creature

B · The Gates & "Done-When" Ladder

Three tracks, three completion bars. A track is not "done" because the code runs — it is done when it hits its bar on repeated, honest trials. No cherry-picking the one good take.

Track 1 · Body

Done when…

Language-prompted drive-and-pick ≥ 60% over 10 trials
Prompt varies ("grab the red cup", "pick up the block")
Includes the drive-up phase, not just a pre-positioned grasp
Failures logged with cause, not hidden

Track 2 · Voice

Done when…

Phone holds a sub-~500ms-feel conversation
Interruptible — you can talk over it (barge-in)
Lip-syncing face that moves with the speech
Affect: face emotion matches conversational state

Track 3 · Creature

Done when…

End-to-end: spoken command → expressive ack →
drive → grasp → expressive report
60–70% reliable over repeated runs
Recovers or fails gracefully on the other 30–40%

Honest scope This is 14-day operational mastery, not a research result. 60–70% end-to-end reliability with a creature that talks and grasps is a genuinely hard, genuinely impressive bar. Resist the urge to claim more than the metrics show.

C · The ETH Study Spine

The theory backbone is Robot Learning: From Fundamentals to Foundation Models (Oier Mees, ETH Zürich, Spring 2026). Twelve lectures, each mapped to where it lands in your build and which course page derives it.

cvg.ethz.ch/lectures/Robot-Learning/ · HW repo github.com/mees-robot-learning-course/ethz-course-2026

Wk	Lecture	Core papers	Maps to (track)	Course page
1	Intro to Robot Learning	—	framing	Home / Foundations
2	Robot Control & MDPs	Mania (random search RL); Irpan "Deep RL Doesn't Work Yet"; Pathak (curiosity)	T1 / RL	Foundations
3	Imitation Learning	de Haan "Causal Confusion"; Pari (representation for visual imitation); Zeng "Transporter Networks"	T1 core	Foundations
4	Reinforcement Learning I	Salimans (evolution strategies); Zeng (pushing + grasping); Luo "HIL-SERL"	T1 / past-the-ceiling	Foundations / Action Models
5	Reinforcement Learning II	Levine (end-to-end visuomotor); Ma "Eureka"; Rosete-Beas (latent plans)	T1	Action Models
6	Generative Models	Janner / Du "Diffuser"; Florence "Implicit BC"; Wagenmaker (steering diffusion policy w/ latent RL)	T1 diffusion policy	Action Models
7	Sequence Modeling & Transformers	Chen "Decision Transformer"; Zhao "ALOHA / ACT"; Radosavovic (humanoid as next-token)	T1 ACT / chunking	Action Models
8	World Models	Du "UniPi"; Hafner "Training Agents Inside Scalable World Models"; Ye "DreamZero"	T1 world models	World Models
9	Generalist Robot Policies	Lynch (play / LMP); Reed "Gato"; Physical Intelligence "π*0.6"	T1 VLAs / SmolVLA	VLAs & Body
10	Embodied Reasoning & Test-time Scaling	Fu (in-context imitation); Wang "Voyager"; Chen (efficient embodied reasoning)	T1 / T3	The Creature
11	Frontier & Open Problems	LeCun "Autonomous Machine Intelligence"; Sutton "Bitter Lesson"; Brooks "Intelligence without Representation"	all	The Creature
12	Guest lectures (Fox, Abbeel)	—	all	—

Guest spotlights worth following Gupta · Danfei Xu · Aviral Kumar · Wagenmaker · Cheng Chi (diffusion policy / UMI) · Ted Xiao · Scott Reed (world models) · Quan Vuong (Physical Intelligence) · Archit Sharma · Lucas Beyer. Their talks and papers are the live edge of everything in the table above.

D · The Demo Runbook `demo/`

Startup sequence — bring it up in this order

Pi · LeKiwi host — power the robot, start the host process on the Raspberry Pi. Confirm motors respond. T1
GPU · policy_server — launch the VLA policy server on the GPU box. Confirm it loads your frozen Day-6 checkpoint. T1
Moshi + intent watcher — start full-duplex audio and the voice→intent watcher. Confirm a test utterance produces an intent. T2
Face on phone — open the face client; confirm it connects and lip-syncs to TTS. T2
Orchestrator — start it last; it is the conductor and assumes the others are already up. Watch its state log go to IDLE. T3

Capture setup Screen = the face (phone mirrored / screen-recorded). Room cam = the robot driving + grasping. On-screen metrics = current state, intent, success flag, latency. Frame both so the face reaction and the physical action are visible together — that synchrony is what sells the creature.

The rehearsed demo script — storyboard

Run A: a clean command-to-completion. Run B: an interrupt mid-task to prove full-duplex barge-in. Each panel ties a creature state to a face emotion and what it says.

face emotion / speech orchestrator state interrupt (barge-in)

Panel 1 · Listen

human speaks

"grab the red cup"

state: LISTENING · attentive

Panel 2 · Think + ack

parses intent

"on it!"

state: THINKING · curious

Panel 3 · Drive

VLA drives to cup

— (silent, focused)

state: DRIVING · focused

Panel 4 · Grasp

closes gripper

— (concentrating)

state: GRASPING · determined

Panel 5 · Report

lifts + returns

"here you go!"

state: REPORTING · happy

Run A — the clean path: command → think/ack → drive → grasp → happy report. Face emotion tracks the orchestrator state at every step.

Run B · 1 · Working

mid-drive to cup

— (driving)

state: DRIVING

Run B · 2 · INTERRUPT

human talks over it

"wait — the blue one!"

barge-in detected · surprised

Run B · 3 · Re-plan

stops, updates intent

"oh — the blue one, got it!"

state: THINKING · re-plans

Run B · 4 · Resume

drives to blue cup

"here's the blue one!"

state: DRIVING→REPORT · happy

Run B — the money shot: interrupt mid-task. Full-duplex barge-in lets the human override; the creature reacts (surprised → re-plan → resume) instead of finishing the wrong command.

E · Risk & Hardware-Gremlins Register

The failures you will actually hit, and what to do about them. Keep this open during rehearsal.

Likely failure	Symptom	Mitigation
Calibration drift	Arm misses where it "thinks" the object is; success decays over a session	Re-run calibration daily and before the demo; log calibration timestamp in RUNBOOK.md
Jerky / unsafe motion	Robot lurches between actions	Usually wrong `n_action_steps` / chunk size — tune action-step execution to smooth the chunk boundary
Moshi VRAM	OOM on load or mid-conversation crash	Check VRAM headroom; don't co-locate with the VLA on the same GPU; use the right model size
WebSocket races	Face freezes, intent arrives out of order, state machine stalls	Strict startup order (see runbook); add reconnection + sequence numbers on the bus
Grasp false positives	Reports "got it" with an empty gripper	Confirm grasp with current/force or a vision check before transitioning to REPORT
Network latency on the tether	Laggy teleop / control; conversation feels slow	Wired link Pi↔GPU where possible; measure round-trip; keep policy_server close to the host

F · Runbook & Operating Notes

Keep a living `RUNBOOK.md`

Exact reproduce steps — anyone can bring the demo up from cold
Logged metrics: success %, reliability, latency, dates
Calibration timestamps + which checkpoint is "the demo one"

Operating discipline

Run long jobs in tmux so a dropped SSH doesn't kill them
Log metrics every run — never trust a single take
Verify lerobot / moshi / pipecat --help before running — these tools are version-sensitive

Ask before you burn time or bandwidth Multi-hour GPU runs and big downloads are expensive and slow. Confirm before kicking one off — wrong flags discovered three hours in is a lost day on a 14-day clock.

The through-line Week 1 makes each capability hit its bar (Gate D7). Week 2 fuses them into a creature that talks, drives, and grasps — 60–70% reliable, rehearsed, recorded, and honestly written up by Ship D14.

A · The 14-Day Sprint

Week 1 — build each capability

Install, calibrate, teleop

Record demos · run Moshi

Fine-tune SmolVLA · design face

Deploy + eval VLA · face v1

World models · push VLA · barge-in

Pick demo policy · wrap voice→intent

GATE — week-1 capability bar

Week 2 — fuse & ship

Orchestrator + message bus

Navigation tier A/B

Close the loop end-to-end

Reliability hardening

HRI polish

Rehearse + record

SHIP — release + retro

B · The Gates & "Done-When" Ladder

Done when…

Done when…

Done when…

C · The ETH Study Spine

D · The Demo Runbook demo/

Startup sequence — bring it up in this order

The rehearsed demo script — storyboard

E · Risk & Hardware-Gremlins Register

F · Runbook & Operating Notes

Keep a living RUNBOOK.md

Operating discipline

D · The Demo Runbook `demo/`

Keep a living `RUNBOOK.md`