Page 7 · Plan & Demo

The 14-Day Sprint

A day-by-day roadmap to bring ANIMA-Kiwi to life — build each capability in week one, fuse them into one creature in week two. With the gates that say "done," the ETH study spine that grounds the theory, and a rehearsed demo runbook so the live show actually works.

Track 1 — Body / VLA Track 2 — Voice / Face Track 3 — Creature / orchestration Theory Gate Ship

A · The 14-Day Sprint

You are at Day 0. Hardware is on the bench, repos are cloned. The strip below is your map; the cards under it are your daily marching orders. Each card is colour-coded by its dominant track and links the page that teaches the how-to.

◀ Week 1 · build each capability
Week 2 · fuse & ship ▶
You are here · Day 0
D1
install·teleop
D2
record·Moshi
D3
finetune·face
D4
deploy·lipsync
D5
barge-in
D6
policy·intent
D7
GATE
D8
orchestrator
D9
navigation
D10
close loop
D11
harden
D12
HRI polish
D13
rehearse
D14
SHIP
The sprint strip — Day 0 (you are here) → Day 14 (ship). Axis colour shows the dominant track per stretch; ◆ D7 is the hard gate, ● D14 is the ship line.

Week 1 — build each capability

Day 1 · T1 T2

Install, calibrate, teleop

  • LeRobot installed; LeKiwi calibrated; clean teleop loop
  • Stand up the Pipecat cascade (STT→LLM→TTS) baseline
→ VLAs & Body
Day 2 · T1 T2

Record demos · run Moshi

  • Record ≥50 demos — include driving up to the object, not just the grasp
  • Get Moshi running locally (full-duplex audio)
→ VLAs & Body
Day 3 · T1 T2

Fine-tune SmolVLA · design face

  • Fine-tune SmolVLA on your dataset (GPU)
  • Design the expressive face: emotion set + visemes
→ VLAs & Body
Day 4 · T1 T2

Deploy + eval VLA · face v1

  • Deploy policy_server; run eval rollouts; baseline success %
  • Face v1 on the phone with lip-sync to TTS
→ VLAs & Body
Day 5 · Theory T1 T2

World models · push VLA · barge-in

  • Read the world-models material (Lecture 8 spine)
  • Push VLA success up; clean up failure modes
  • Wire Moshi → face with barge-in (interruptible)
→ World Models
Day 6 · T1 T2 T3

Pick demo policy · wrap voice→intent

  • Choose the policy you will demo; freeze its checkpoint
  • Wrap voice → structured intent (the bridge to T3)
→ Interaction
Gate
Day 7 · T1 T2

GATE — week-1 capability bar

  • VLA ≥ 60% language-prompted drive-and-pick
  • Expressive duplex conversation working end-to-end
  • Buffer day for hardware gremlins — do not skip it
→ Interaction
The gate is the whole point If you are not through Day 7 by Day 7, do not start Week 2. Fusing two unreliable subsystems gives you one very unreliable creature. Spend the buffer on whichever track is below bar.

Week 2 — fuse & ship

Day 8 · T3

Orchestrator + message bus

  • Stand up the orchestrator state machine
  • Message bus wiring all subsystems together
→ The Creature
Day 9 · T3

Navigation tier A/B

  • Pick the navigation tier: A (VLA drives) or B (explicit nav)
  • Get "drive to vicinity of object" reliable
→ The Creature
Day 10 · T3

Close the loop end-to-end

  • Spoken command → ack → drive → grasp → report
  • First full run, however shaky
→ The Creature
Day 11 · T3

Reliability hardening

  • Kill flaky transitions, add timeouts + retries
  • Measure reliability over repeated trials
→ The Creature
Day 12 · T2 T3

HRI polish

  • Gaze, backchannels ("mm-hm"), affect transitions
  • Make the creature feel alive, not robotic
→ Interaction
Day 13 · T3

Rehearse + record

  • Run the demo script start-to-finish, many times
  • Record clean takes; capture metrics overlay
→ The Creature
Ship
Day 14 · T3

SHIP — release + retro

  • Publish the demo video + RUNBOOK.md
  • Write-up: what worked, what didn't, metrics
  • Retro: the honest postmortem
→ The Creature

B · The Gates & "Done-When" Ladder

Three tracks, three completion bars. A track is not "done" because the code runs — it is done when it hits its bar on repeated, honest trials. No cherry-picking the one good take.

Track 1 · Body

Done when…

  • Language-prompted drive-and-pick ≥ 60% over 10 trials
  • Prompt varies ("grab the red cup", "pick up the block")
  • Includes the drive-up phase, not just a pre-positioned grasp
  • Failures logged with cause, not hidden
Track 2 · Voice

Done when…

  • Phone holds a sub-~500ms-feel conversation
  • Interruptible — you can talk over it (barge-in)
  • Lip-syncing face that moves with the speech
  • Affect: face emotion matches conversational state
Track 3 · Creature

Done when…

  • End-to-end: spoken command → expressive ack →
  • drive → grasp → expressive report
  • 60–70% reliable over repeated runs
  • Recovers or fails gracefully on the other 30–40%
Honest scope This is 14-day operational mastery, not a research result. 60–70% end-to-end reliability with a creature that talks and grasps is a genuinely hard, genuinely impressive bar. Resist the urge to claim more than the metrics show.

C · The ETH Study Spine

The theory backbone is Robot Learning: From Fundamentals to Foundation Models (Oier Mees, ETH Zürich, Spring 2026). Twelve lectures, each mapped to where it lands in your build and which course page derives it.

cvg.ethz.ch/lectures/Robot-Learning/ · HW repo github.com/mees-robot-learning-course/ethz-course-2026

WkLectureCore papersMaps to (track)Course page
1Intro to Robot LearningframingHome / Foundations
2Robot Control & MDPsMania (random search RL); Irpan "Deep RL Doesn't Work Yet"; Pathak (curiosity)T1 / RLFoundations
3Imitation Learningde Haan "Causal Confusion"; Pari (representation for visual imitation); Zeng "Transporter Networks"T1 coreFoundations
4Reinforcement Learning ISalimans (evolution strategies); Zeng (pushing + grasping); Luo "HIL-SERL"T1 / past-the-ceilingFoundations / Action Models
5Reinforcement Learning IILevine (end-to-end visuomotor); Ma "Eureka"; Rosete-Beas (latent plans)T1Action Models
6Generative ModelsJanner / Du "Diffuser"; Florence "Implicit BC"; Wagenmaker (steering diffusion policy w/ latent RL)T1 diffusion policyAction Models
7Sequence Modeling & TransformersChen "Decision Transformer"; Zhao "ALOHA / ACT"; Radosavovic (humanoid as next-token)T1 ACT / chunkingAction Models
8World ModelsDu "UniPi"; Hafner "Training Agents Inside Scalable World Models"; Ye "DreamZero"T1 world modelsWorld Models
9Generalist Robot PoliciesLynch (play / LMP); Reed "Gato"; Physical Intelligence "π*0.6"T1 VLAs / SmolVLAVLAs & Body
10Embodied Reasoning & Test-time ScalingFu (in-context imitation); Wang "Voyager"; Chen (efficient embodied reasoning)T1 / T3The Creature
11Frontier & Open ProblemsLeCun "Autonomous Machine Intelligence"; Sutton "Bitter Lesson"; Brooks "Intelligence without Representation"allThe Creature
12Guest lectures (Fox, Abbeel)all
Guest spotlights worth following Gupta · Danfei Xu · Aviral Kumar · Wagenmaker · Cheng Chi (diffusion policy / UMI) · Ted Xiao · Scott Reed (world models) · Quan Vuong (Physical Intelligence) · Archit Sharma · Lucas Beyer. Their talks and papers are the live edge of everything in the table above.

D · The Demo Runbook demo/

Startup sequence — bring it up in this order

  1. Pi · LeKiwi host — power the robot, start the host process on the Raspberry Pi. Confirm motors respond. T1
  2. GPU · policy_server — launch the VLA policy server on the GPU box. Confirm it loads your frozen Day-6 checkpoint. T1
  3. Moshi + intent watcher — start full-duplex audio and the voice→intent watcher. Confirm a test utterance produces an intent. T2
  4. Face on phone — open the face client; confirm it connects and lip-syncs to TTS. T2
  5. Orchestrator — start it last; it is the conductor and assumes the others are already up. Watch its state log go to IDLE. T3
Capture setup Screen = the face (phone mirrored / screen-recorded). Room cam = the robot driving + grasping. On-screen metrics = current state, intent, success flag, latency. Frame both so the face reaction and the physical action are visible together — that synchrony is what sells the creature.

The rehearsed demo script — storyboard

Run A: a clean command-to-completion. Run B: an interrupt mid-task to prove full-duplex barge-in. Each panel ties a creature state to a face emotion and what it says.

face emotion / speech orchestrator state interrupt (barge-in)
Panel 1 · Listen
human speaks
"grab the red cup"
state: LISTENING · attentive
Panel 2 · Think + ack
parses intent
"on it!"
state: THINKING · curious
Panel 3 · Drive
VLA drives to cup
— (silent, focused)
state: DRIVING · focused
Panel 4 · Grasp
closes gripper
— (concentrating)
state: GRASPING · determined
Panel 5 · Report
lifts + returns
"here you go!"
state: REPORTING · happy
Run A — the clean path: command → think/ack → drive → grasp → happy report. Face emotion tracks the orchestrator state at every step.
Run B · 1 · Working
mid-drive to cup
— (driving)
state: DRIVING
Run B · 2 · INTERRUPT
human talks over it
"wait — the blue one!"
barge-in detected · surprised
Run B · 3 · Re-plan
stops, updates intent
"oh — the blue one, got it!"
state: THINKING · re-plans
Run B · 4 · Resume
drives to blue cup
"here's the blue one!"
state: DRIVING→REPORT · happy
Run B — the money shot: interrupt mid-task. Full-duplex barge-in lets the human override; the creature reacts (surprised → re-plan → resume) instead of finishing the wrong command.

E · Risk & Hardware-Gremlins Register

The failures you will actually hit, and what to do about them. Keep this open during rehearsal.

Likely failureSymptomMitigation
Calibration driftArm misses where it "thinks" the object is; success decays over a sessionRe-run calibration daily and before the demo; log calibration timestamp in RUNBOOK.md
Jerky / unsafe motionRobot lurches between actionsUsually wrong n_action_steps / chunk size — tune action-step execution to smooth the chunk boundary
Moshi VRAMOOM on load or mid-conversation crashCheck VRAM headroom; don't co-locate with the VLA on the same GPU; use the right model size
WebSocket racesFace freezes, intent arrives out of order, state machine stallsStrict startup order (see runbook); add reconnection + sequence numbers on the bus
Grasp false positivesReports "got it" with an empty gripperConfirm grasp with current/force or a vision check before transitioning to REPORT
Network latency on the tetherLaggy teleop / control; conversation feels slowWired link Pi↔GPU where possible; measure round-trip; keep policy_server close to the host

F · Runbook & Operating Notes

Keep a living RUNBOOK.md

  • Exact reproduce steps — anyone can bring the demo up from cold
  • Logged metrics: success %, reliability, latency, dates
  • Calibration timestamps + which checkpoint is "the demo one"

Operating discipline

  • Run long jobs in tmux so a dropped SSH doesn't kill them
  • Log metrics every run — never trust a single take
  • Verify lerobot / moshi / pipecat --help before running — these tools are version-sensitive
Ask before you burn time or bandwidth Multi-hour GPU runs and big downloads are expensive and slow. Confirm before kicking one off — wrong flags discovered three hours in is a lost day on a 14-day clock.
The through-line Week 1 makes each capability hit its bar (Gate D7). Week 2 fuses them into a creature that talks, drives, and grasps — 60–70% reliable, rehearsed, recorded, and honestly written up by Ship D14.