Foundations — ANIMA-Kiwi

A · Framing — why imitation learning is our backbone

We want a small mobile creature that drives to an object, grasps it, and talks about it. Three families of methods could produce the driving/grasping policy: (i) hand-engineered control + planning, (ii) reinforcement learning from a reward signal, or (iii) imitation learning — learn the policy directly from human demonstrations. For a 14-day build with one robot and no simulator-to-real pipeline, RL is too sample-hungry and reward design is its own research project. Hand-engineering generalizes poorly to clutter and novel objects. So we clone teleoperation demos: a human drives the Kiwi with a leader arm / gamepad, we record (observation, action) pairs, and we fit a policy that reproduces them. This is exactly the regime LeRobot is built for, and it is the cheapest path to a working creature.

Theory · Markov Decision Process recap

A control problem is an MDP ⟨S, A, P, r, ρ0, γ⟩. State s∈S (joint angles, base pose, camera frames), action a∈A (motor / velocity commands), transition dynamics P(s′|s,a), reward r(s,a), start distribution ρ0, discount γ. A policy πθ(a|s) maps states to a distribution over actions.

The policy plus the dynamics induce a trajectory distribution. Rolling out a policy for horizon T:

pπ(τ) = ρ0(s0) ∏t=0T−1 πθ(at|st) P(st+1|st,at)

The key object for everything below is the state-visitation distribution a policy induces — the marginal over states you actually see when you run it:

dπ(s) = (1−γ) Σt=0∞ γt Pr[ st = s | π ]

The central tension of imitation learning lives in one inequality: dπθ ≠ dπ*. The learner visits different states than the expert. Hold onto that — Sections B and C are entirely about its consequences.

Maps to ETH lectures

Lecture 1 (Intro to Robot Learning) motivates learning over hand-engineering. Lecture 2 (Robot Control & MDPs) formalizes states, actions, dynamics, and policies — the vocabulary above. We build directly on both.

Maps to my build Our ≥50 teleop episodes are the trajectories τ; the Kiwi's joint + base + camera stream is s; the leader-arm / gamepad command is a. There is no reward function in our build — we never write r(s,a). That is precisely why we imitate instead of optimizing reward.

B · Behavioral Cloning — supervised learning of a policy

Behavioral cloning (BC) throws away the sequential structure and treats control as plain supervised learning. Collect a dataset of expert pairs D = { (si, ai) } drawn from the expert's own visitation dπ*, then fit a policy by maximum likelihood:

θ⋆ = argminθ 𝔼(s,a)∼D [ −log πθ(a|s) ]

For a deterministic policy a = fθ(s) this is just regression — minimize squared error (the Gaussian-likelihood special case with fixed variance):

θ⋆ = argminθ 𝔼(s,a)∼D ‖ fθ(s) − a ‖2

Why BC is the default — and why LeRobot ships it first:

It is cheap. No reward, no environment interaction during training, no simulator. Just demos in, gradient steps out.
It is stable. Supervised learning has well-understood optimization; you can overfit/underfit diagnose it with a normal train/val split.
It scales with data, not with engineering. More good demos ⇒ a better policy, with no new code.

The catch is hidden in that expectation: it is taken over D∼dπ*, the expert's states. At test time the policy runs in closed loop and generates dπθ, its own states. BC optimizes the wrong distribution. Section C is the price of that mismatch.

Maps to my build In LeRobot, our recorded episodes become D; an act or diffusion policy is πθ; lerobot-train runs exactly the BC objective above. The Body policy we ship in week 2 is, at its core, this argmin.

C · Covariate Shift — the central failure (and DAgger, the fix)

Suppose BC is good: it matches the expert on the expert's states with small per-step error. Define the per-step error as the probability the learner picks a different action than the expert under the expert's distribution:

𝔼s∼dπ* [ 1{ πθ(s) ≠ π*(s) } ] ≤ ε

Here is the heartbreak. The first time the learner errs, it lands in a state the expert never visited — off dπ*. There, training said nothing, so the error rate is no longer bounded; the policy can do anything, drift further, and visit states even more foreign. Errors don't add up — they compound. The classic result (Ross & Bagnell):

Theory · The compounding-error bound

J(πθ) − J(π*) ≤ O( ε · T2 )

where J is expected trajectory cost and T the horizon. The expert pays O(ε T) (errors only add); BC pays an extra factor of T because each early mistake pushes the policy into unseen states where it keeps erring for the rest of the episode.

Why the extra T? — the argument in one breath

Walk it through. At each of T steps there is probability ≤ ε of a "first mistake." After a mistake the learner is off-distribution and we conservatively assume it accrues cost on every remaining step — up to T more steps. Summing the expected number of mistakes (≈ εT) times the cost-tail of each (≈ T) gives ≈ ε T2. Quadratic in the horizon. Long episodes are punished hardest.

Intuition · driving up to the object

Imagine BC driving the Kiwi straight at a cup. A tiny steering error nudges the base 2 cm off the demonstrated line. Now the camera sees the cup from an angle no demo ever showed. The policy guesses, oversteers, lands somewhere even more novel — and by the time it should reach the cup it is facing a wall. The expert never made that first error, so the expert never saw any of those states, so the dataset never taught the recovery. The failure is not a bad gradient — it is a distribution the model was never trained on.

DAgger — train on the states the learner actually visits

DAgger (Dataset Aggregation) closes the loop. Instead of only labeling the expert's states, you run the current learner, collect the states it visits, ask the expert "what would you do here?", add those new labels to the dataset, and retrain. Iterate. Over rounds the training distribution converges to dπθ — the policy's own visitation — turning the quadratic bound back into a linear one.

Roll out the learner. Run πθ in the environment; record the states st it actually visits (including its mistakes).
Query the expert. For each visited state, get the expert action at = π*(st) — the correct recovery.
Aggregate. Append these (s,a) pairs to the dataset: D ← D ∪ { (st, π*(st)) }.
Retrain. Refit πθ on the aggregated D; go back to step 1.

Diagram 1 — the DAgger loop. Animated signal dot travels round the four steps; with motion off it reads as a static labeled cycle.

learner steps (Body) expert query (theory) dataset growth

Maps to ETH Lecture 3 — Imitation Learning · papers

Three results from Lecture 3 sharpen how we collect and represent demos. Each is summarized below with a build badge.

Lecture 3

Causal Confusion in Imitation Learning

de Haan, Jayaraman & Levine, 2019

Counter-intuitively, giving a BC policy more observation information can make it worse. The model latches onto whatever feature best predicts the expert action in the dataset — even a spurious one. The famous case is the inertia / "brake light" problem: a dashboard indicator that lights up because the expert is braking becomes the model's cue for braking, so the cloned driver freezes whenever it is already stopped. The cause-effect direction is confused. Fixes: intervene to break the spurious correlation (targeted expert queries, like DAgger) or learn a disentangled causal representation, then test which graph matches reality.

Maps to my buildDon't feed the Kiwi policy "helpful" extra channels (e.g. last action, episode timer) blindly — they invite shortcut learning. Keep the observation lean and demonstrate recoveries.

Lecture 3

The Surprising Effectiveness of Representation Learning for Visual Imitation

Pari, Shafiullah, Arunachalam & Pinto, 2021

Splits visual imitation into what to represent vs how to act. Use a self-supervised, frozen visual encoder (e.g. BYOL features) to embed observations, then act by non-parametric nearest-neighbours in that embedding space — locate the closest demonstrated states and copy their actions, no policy network trained at all. With good representations this matches or beats end-to-end BC from a handful of demos, and it is strikingly data-efficient. The lesson: a large share of visual imitation difficulty is a representation problem, not a control problem.

Maps to my buildIf our learned policy struggles on few demos, a frozen pretrained encoder + nearest-neighbour lookup is a fast, debuggable baseline for the Body task.

Lecture 3

Transporter Networks: Rearranging the Visual World

Zeng et al., 2020

For pick-and-place / rearrangement, instead of regressing raw end-effector coordinates, reframe action as spatial template-matching. The network predicts a pick location as a dense heat-map over the image, then learns a place by cross-correlating (transporting) the features around the pick point against the scene to score every candidate place pose. This spatial-equivariant formulation is dramatically sample-efficient — a handful of demos generalizes to new object arrangements — because the inductive bias matches the structure of the task. A clean example of action parameterization doing the heavy lifting.

Maps to my buildThe Kiwi's grasp is a pick task. If end-effector-coordinate BC is brittle, a spatial-action head is the principled upgrade for Body grasping.

Maps to my build We can't query a human at 30 Hz mid-rollout, so we approximate DAgger by recording demos that include the failure region: explicitly teleop "drive up to the object from a bad angle and recover." Recording driving-up-to-the-object is our hedge against covariate shift — it injects the learner's likely off-distribution states into D up front.

D · The Multimodality Problem — why a plain regressor mode-averages

Covariate shift is about where the policy is. Multimodality is about what it predicts even on perfectly in-distribution states. Human demonstrations are multimodal: from the same state there are often several equally-good actions. Approaching an obstacle, the expert sometimes swerves left, sometimes right — both reach the goal. A unimodal model (an MSE regressor, or a single Gaussian) cannot represent "left or right." It is forced to commit to one number, and it picks the worst possible one: the average.

Theory · the L2-optimal predictor is the conditional mean

Minimize expected squared error of a deterministic predictor f(s) against the true action distribution p(a|s):

L(f) = 𝔼a∼p(a|s) ‖ f(s) − a ‖2

Take the gradient w.r.t. the output and set it to zero (drop the state for clarity, write f for f(s)):

∂L/∂f = 𝔼a[ 2(f − a) ] = 0 ⟹ f = 𝔼a∼p(a|s)[ a ] = 𝔼[a|s]

The unique L2 minimizer is the conditional mean. If p(a|s) is bimodal at left and right, the mean sits between them — pointing straight at the obstacle, an action neither demonstrator ever took. (The same collapse afflicts a single-Gaussian likelihood: its MLE mean is also 𝔼[a|s].)

Diagram 2 — a bimodal action distribution. Both modes (left, right) are valid expert behavior; the conditional mean the MSE policy is forced to output sits between them, at the one action that fails.

mode 1 · left detour mode 2 · right detour MSE prediction = mean obstacle

Forward link → Page 2 · Action Models

The whole reason the next page exists is this figure. To not mode-average we need policies that model the full, multimodal action distribution: CVAEs / ACT (latent-conditioned action chunks), diffusion policies, and flow matching. Each can place probability mass on both left and right and then sample one — never the dead-center average. Continue to 2 · Action Models.

Maps to my build Multimodality is exactly why we do not ship a plain MSE policy for the Kiwi. Human teleop of the same approach is naturally multimodal, so our default Body policy is an expressive one (ACT / diffusion) that can represent several valid ways to reach the object.

E · The hidden assumption — i.i.d. supervised learning vs the closed loop

Both failures above trace to one broken assumption. Supervised learning assumes train and test data are drawn i.i.d. from the same fixed distribution. In control that is false: the policy generates its own test data by acting, so the test distribution is a function of the policy you trained. Train ≠ test, by construction.

Diagram 3 — the broken assumption. Left: one distribution serves train and test (the i.i.d. ideal). Right: the policy acts, the environment moves, and the policy's own visitation drifts away from what it trained on.

i.i.d. (same distribution) policy environment distribution drift

Aspect	i.i.d. supervised learning	Closed-loop control (our reality)
Test distribution	Fixed, equals train	Generated by the policy; depends on θ
Errors over time	Independent per example	Compound — O(ε T2)
Target structure	Often unimodal label	Multimodal expert actions
Fix	More i.i.d. data	DAgger-style on-policy labels + expressive policy

Maps to my build This single insight justifies our whole data-collection protocol: we deliberately demonstrate the states the Kiwi will drift into, and we choose a policy class that doesn't collapse multimodal targets. The lab below lets you watch both failures happen on a 2D toy — no robot required.

Lab — Feel covariate shift and mode-averaging yourself

Lab 1 A 2D point-to-goal toy: watch MSE-BC fail, then watch DAgger and a 2-mode head fix it

Pure PyTorch + NumPy + Matplotlib. No robot, no simulator, runs on CPU in seconds. The point is to see the theory above with your own eyes before we touch hardware.

Setup

# Python 3.10+ — versions are forgiving here, this is a toy.
pip install torch numpy matplotlib

Generate a multimodal expert dataset. A point starts on the left, must reach a goal on the right, with an obstacle in the middle. The scripted expert picks — per episode, at random — the upper or lower detour. That coin-flip is what injects multimodality: from the same start state, the expert action is sometimes "go up," sometimes "go down."

import numpy as np, torch, torch.nn as nn

GOAL = np.array([9., 0.]); OBST = np.array([4.5, 0.]); R = 1.4

def expert_action(s, route):
    # route=+1 detour up, route=-1 detour down. Returns a velocity step.
    way = GOAL if s[0] > OBST[0] else OBST + np.array([0., route*2.2])
    d = way - s; n = np.linalg.norm(d) + 1e-8
    return (d / n) * 0.6            # fixed-speed step toward the waypoint

def rollout_expert(route, steps=40):
    s = np.array([0., np.random.uniform(-.5, .5)]); S, A = [], []
    for _ in range(steps):
        a = expert_action(s, route); S.append(s.copy()); A.append(a); s = s + a
    return np.array(S), np.array(A)

S, A = [], []
for ep in range(120):                  # ~half go up, half go down
    s, a = rollout_expert(route=1 if ep % 2 else -1)
    S.append(s); A.append(a)
Sd = torch.tensor(np.concatenate(S), dtype=torch.float32)
Ad = torch.tensor(np.concatenate(A), dtype=torch.float32)
print("dataset:", Sd.shape, Ad.shape)   # ~4800 (s,a) pairs, bimodal near the obstacle

Train an MSE behavioral-cloning MLP and roll it out. Standard regression. Then run the policy in closed loop from a fresh start and watch two things: near the obstacle it predicts the average of "up" and "down" (≈ straight = into the obstacle — mode-averaging), and any small deviation puts it in states the demos never covered (covariate shift).

def mlp(out=2):
    return nn.Sequential(nn.Linear(2,128), nn.ReLU(),
                         nn.Linear(128,128), nn.ReLU(), nn.Linear(128,out))

bc = mlp(); opt = torch.optim.Adam(bc.parameters(), 1e-3)
for it in range(3000):                  # pure MSE = the L2 objective from Section D
    pred = bc(Sd); loss = ((pred - Ad)**2).mean()
    opt.zero_grad(); loss.backward(); opt.step()

def rollout_policy(net, steps=60):
    s = np.array([0., 0.]); traj = [s.copy()]
    for _ in range(steps):
        a = net(torch.tensor(s, dtype=torch.float32)).detach().numpy()
        s = s + a; traj.append(s.copy())
    return np.array(traj)

traj = rollout_policy(bc)
hit = ((traj - OBST)[:, :]**2).sum(1).min()**.5 < R
print("closest approach to obstacle:", round(float(((traj-OBST)**2).sum(1).min()**.5),2),
      "| COLLISION" if hit else "| clear")
# Expected: the MSE policy drives ~straight, collides, and/or drifts off into
# states no demo covered. That is mode-averaging + covariate shift, live.

One DAgger iteration. We have a scripted expert, so we can do the real thing: roll out the current policy, relabel the states it actually visited with the expert's action, aggregate, retrain. Error on the policy's own visitation drops.

def dagger_relabel(net, n=40):
    Sn, An = [], []
    for _ in range(n):
        s = np.array([0., np.random.uniform(-.5,.5)])
        route = np.random.choice([-1, 1])     # expert still has a preferred mode
        for _ in range(60):
            a_star = expert_action(s, route)       # label = what the EXPERT would do here
            Sn.append(s.copy()); An.append(a_star)
            a = net(torch.tensor(s,dtype=torch.float32)).detach().numpy()  # move by LEARNER
            s = s + a
    return torch.tensor(np.array(Sn),dtype=torch.float32), torch.tensor(np.array(An),dtype=torch.float32)

Sx, Ax = dagger_relabel(bc)
Sd2 = torch.cat([Sd, Sx]); Ad2 = torch.cat([Ad, Ax])     # AGGREGATE
for it in range(3000):                       # RETRAIN on the union
    loss = ((bc(Sd2)-Ad2)**2).mean()
    opt.zero_grad(); loss.backward(); opt.step()
print("after 1 DAgger round, on-policy MSE:",
      round(float(((bc(Sx)-Ax)**2).mean()),4))
# Expected: on-policy error drops — the model has now seen its own drift states.
# Note it still mode-averages near the obstacle: DAgger fixes covariate shift,
# NOT multimodality. Those are two different diseases. Hence step 4.

(Optional) Swap the MSE head for a 2-mode mixture head. Predict two candidate actions plus a logit choosing between them, and train with the "min-over-modes" loss so each mode specializes (this is the toy ancestor of the CVAE/diffusion idea on Page 2). At rollout, sample a mode — the policy now commits to a real detour instead of the dead-center average.

K = 2
head = mlp(out=K*2 + K)     # K action means (2D each) + K mixing logits
opt2 = torch.optim.Adam(head.parameters(), 1e-3)
for it in range(4000):
    o = head(Sd); mus = o[:, :K*2].view(-1, K, 2); logit = o[:, K*2:]
    se = ((mus - Ad[:, None, :])**2).sum(-1)       # (N,K) error to each mode
    win = se.argmin(1)                                # assign each sample to its closest mode
    reg = se[torch.arange(len(se)), win].mean()       # "min-over-modes" regression loss
    cls = nn.functional.cross_entropy(logit, win)        # learn the mixing weights
    loss = reg + cls
    opt2.zero_grad(); loss.backward(); opt2.step()

def sample_action(s):
    o = head(torch.tensor(s,dtype=torch.float32))
    mus = o[:K*2].view(K,2); k = torch.distributions.Categorical(logits=o[K*2:]).sample()
    return mus[k].detach().numpy()                      # COMMIT to one mode, never the mean
# Expected: rolling out sample_action follows a clean upper OR lower detour and
# reaches the goal. Re-run it a few times — you'll see BOTH routes recovered.

Expected observations

Step	What you should see	Which theory it shows
2 · MSE-BC	Policy drives roughly straight, grazes/hits the obstacle, then drifts into untrained regions	Mode-averaging (§D) + covariate shift (§C)
3 · DAgger	On-policy MSE drops; the policy stops wandering into novel states — but still steers center near the obstacle	DAgger fixes covariate shift, not multimodality (§C)
4 · 2-mode head	Rollouts pick a clean upper or lower route and reach the goal; re-sampling shows both	An expressive multimodal policy (teaser for §Page 2)

Done when

You can state in one sentence why MSE fails on multimodal data — it outputs 𝔼[a|s], the mean of the modes, which is itself not a valid action.
You can explain why DAgger reduces compounding error — it trains on the learner's own visited states, shrinking the train/test distribution gap from O(εT2) toward O(εT).
You observed empirically that DAgger and the 2-mode head fix different problems (covariate shift vs multimodality) — one does not substitute for the other.
The 2-mode rollout recovers both detour routes across re-samples, not the average of them.

ETH homework pointers (runnable)

Cement the theory with the official course code: github.com/mees-robot-learning-course/ethz-course-2026. The Control / MDPs assignment (Lecture 2) and the Imitation Learning assignment (Lecture 3, including a DAgger implementation) are runnable and map one-to-one onto Sections A–C above. Do those before page 2.

Version note: the toy lab above is pinned to nothing fancy — any recent torch ≥ 2.x works. The ETH repo may pin specific versions; create a fresh venv from its requirements.txt rather than your LeRobot env to avoid clashes.

Maps to my build This 2D toy is the Kiwi's "drive up to the object" task in miniature. The obstacle is real clutter; the two detours are the multiple valid approaches a human teleoperator would naturally take; the drift is exactly what we'll see on hardware if we ship plain MSE-BC. Page 2 turns the step-4 mixture head into the real expressive policies (ACT, diffusion, flow matching) we'll deploy.