A · The 14-Day Sprint
You are at Day 0. Hardware is on the bench, repos are cloned. The strip below is your map; the cards under it are your daily marching orders. Each card is colour-coded by its dominant track and links the page that teaches the how-to.
Week 1 — build each capability
Install, calibrate, teleop
- LeRobot installed; LeKiwi calibrated; clean teleop loop
- Stand up the Pipecat cascade (STT→LLM→TTS) baseline
Record demos · run Moshi
- Record ≥50 demos — include driving up to the object, not just the grasp
- Get Moshi running locally (full-duplex audio)
Fine-tune SmolVLA · design face
- Fine-tune SmolVLA on your dataset (GPU)
- Design the expressive face: emotion set + visemes
Deploy + eval VLA · face v1
- Deploy policy_server; run eval rollouts; baseline success %
- Face v1 on the phone with lip-sync to TTS
World models · push VLA · barge-in
- Read the world-models material (Lecture 8 spine)
- Push VLA success up; clean up failure modes
- Wire Moshi → face with barge-in (interruptible)
Pick demo policy · wrap voice→intent
- Choose the policy you will demo; freeze its checkpoint
- Wrap voice → structured intent (the bridge to T3)
GATE — week-1 capability bar
- VLA ≥ 60% language-prompted drive-and-pick
- Expressive duplex conversation working end-to-end
- Buffer day for hardware gremlins — do not skip it
Week 2 — fuse & ship
Orchestrator + message bus
- Stand up the orchestrator state machine
- Message bus wiring all subsystems together
Navigation tier A/B
- Pick the navigation tier: A (VLA drives) or B (explicit nav)
- Get "drive to vicinity of object" reliable
Close the loop end-to-end
- Spoken command → ack → drive → grasp → report
- First full run, however shaky
Reliability hardening
- Kill flaky transitions, add timeouts + retries
- Measure reliability over repeated trials
HRI polish
- Gaze, backchannels ("mm-hm"), affect transitions
- Make the creature feel alive, not robotic
Rehearse + record
- Run the demo script start-to-finish, many times
- Record clean takes; capture metrics overlay
SHIP — release + retro
- Publish the demo video + RUNBOOK.md
- Write-up: what worked, what didn't, metrics
- Retro: the honest postmortem
B · The Gates & "Done-When" Ladder
Three tracks, three completion bars. A track is not "done" because the code runs — it is done when it hits its bar on repeated, honest trials. No cherry-picking the one good take.
Done when…
- Language-prompted drive-and-pick ≥ 60% over 10 trials
- Prompt varies ("grab the red cup", "pick up the block")
- Includes the drive-up phase, not just a pre-positioned grasp
- Failures logged with cause, not hidden
Done when…
- Phone holds a sub-~500ms-feel conversation
- Interruptible — you can talk over it (barge-in)
- Lip-syncing face that moves with the speech
- Affect: face emotion matches conversational state
Done when…
- End-to-end: spoken command → expressive ack →
- drive → grasp → expressive report
- 60–70% reliable over repeated runs
- Recovers or fails gracefully on the other 30–40%
C · The ETH Study Spine
The theory backbone is Robot Learning: From Fundamentals to Foundation Models (Oier Mees, ETH Zürich, Spring 2026). Twelve lectures, each mapped to where it lands in your build and which course page derives it.
cvg.ethz.ch/lectures/Robot-Learning/ · HW repo github.com/mees-robot-learning-course/ethz-course-2026
| Wk | Lecture | Core papers | Maps to (track) | Course page |
|---|---|---|---|---|
| 1 | Intro to Robot Learning | — | framing | Home / Foundations |
| 2 | Robot Control & MDPs | Mania (random search RL); Irpan "Deep RL Doesn't Work Yet"; Pathak (curiosity) | T1 / RL | Foundations |
| 3 | Imitation Learning | de Haan "Causal Confusion"; Pari (representation for visual imitation); Zeng "Transporter Networks" | T1 core | Foundations |
| 4 | Reinforcement Learning I | Salimans (evolution strategies); Zeng (pushing + grasping); Luo "HIL-SERL" | T1 / past-the-ceiling | Foundations / Action Models |
| 5 | Reinforcement Learning II | Levine (end-to-end visuomotor); Ma "Eureka"; Rosete-Beas (latent plans) | T1 | Action Models |
| 6 | Generative Models | Janner / Du "Diffuser"; Florence "Implicit BC"; Wagenmaker (steering diffusion policy w/ latent RL) | T1 diffusion policy | Action Models |
| 7 | Sequence Modeling & Transformers | Chen "Decision Transformer"; Zhao "ALOHA / ACT"; Radosavovic (humanoid as next-token) | T1 ACT / chunking | Action Models |
| 8 | World Models | Du "UniPi"; Hafner "Training Agents Inside Scalable World Models"; Ye "DreamZero" | T1 world models | World Models |
| 9 | Generalist Robot Policies | Lynch (play / LMP); Reed "Gato"; Physical Intelligence "π*0.6" | T1 VLAs / SmolVLA | VLAs & Body |
| 10 | Embodied Reasoning & Test-time Scaling | Fu (in-context imitation); Wang "Voyager"; Chen (efficient embodied reasoning) | T1 / T3 | The Creature |
| 11 | Frontier & Open Problems | LeCun "Autonomous Machine Intelligence"; Sutton "Bitter Lesson"; Brooks "Intelligence without Representation" | all | The Creature |
| 12 | Guest lectures (Fox, Abbeel) | — | all | — |
D · The Demo Runbook demo/
Startup sequence — bring it up in this order
- Pi · LeKiwi host — power the robot, start the host process on the Raspberry Pi. Confirm motors respond. T1
- GPU · policy_server — launch the VLA policy server on the GPU box. Confirm it loads your frozen Day-6 checkpoint. T1
- Moshi + intent watcher — start full-duplex audio and the voice→intent watcher. Confirm a test utterance produces an intent. T2
- Face on phone — open the face client; confirm it connects and lip-syncs to TTS. T2
- Orchestrator — start it last; it is the conductor and assumes the others are already up. Watch its state log go to
IDLE. T3
The rehearsed demo script — storyboard
Run A: a clean command-to-completion. Run B: an interrupt mid-task to prove full-duplex barge-in. Each panel ties a creature state to a face emotion and what it says.
E · Risk & Hardware-Gremlins Register
The failures you will actually hit, and what to do about them. Keep this open during rehearsal.
| Likely failure | Symptom | Mitigation |
|---|---|---|
| Calibration drift | Arm misses where it "thinks" the object is; success decays over a session | Re-run calibration daily and before the demo; log calibration timestamp in RUNBOOK.md |
| Jerky / unsafe motion | Robot lurches between actions | Usually wrong n_action_steps / chunk size — tune action-step execution to smooth the chunk boundary |
| Moshi VRAM | OOM on load or mid-conversation crash | Check VRAM headroom; don't co-locate with the VLA on the same GPU; use the right model size |
| WebSocket races | Face freezes, intent arrives out of order, state machine stalls | Strict startup order (see runbook); add reconnection + sequence numbers on the bus |
| Grasp false positives | Reports "got it" with an empty gripper | Confirm grasp with current/force or a vision check before transitioning to REPORT |
| Network latency on the tether | Laggy teleop / control; conversation feels slow | Wired link Pi↔GPU where possible; measure round-trip; keep policy_server close to the host |
F · Runbook & Operating Notes
Keep a living RUNBOOK.md
- Exact reproduce steps — anyone can bring the demo up from cold
- Logged metrics: success %, reliability, latency, dates
- Calibration timestamps + which checkpoint is "the demo one"
Operating discipline
- Run long jobs in
tmuxso a dropped SSH doesn't kill them - Log metrics every run — never trust a single take
- Verify
lerobot/moshi/pipecat --helpbefore running — these tools are version-sensitive