A · Framing — why imitation learning is our backbone
We want a small mobile creature that drives to an object, grasps it, and talks about it. Three families of methods could produce the driving/grasping policy: (i) hand-engineered control + planning, (ii) reinforcement learning from a reward signal, or (iii) imitation learning — learn the policy directly from human demonstrations. For a 14-day build with one robot and no simulator-to-real pipeline, RL is too sample-hungry and reward design is its own research project. Hand-engineering generalizes poorly to clutter and novel objects. So we clone teleoperation demos: a human drives the Kiwi with a leader arm / gamepad, we record (observation, action) pairs, and we fit a policy that reproduces them. This is exactly the regime LeRobot is built for, and it is the cheapest path to a working creature.
A control problem is an MDP ⟨S, A, P, r, ρ0, γ⟩. State s∈S (joint angles, base pose, camera frames), action a∈A (motor / velocity commands), transition dynamics P(s′|s,a), reward r(s,a), start distribution ρ0, discount γ. A policy πθ(a|s) maps states to a distribution over actions.
The policy plus the dynamics induce a trajectory distribution. Rolling out a policy for horizon T:
The key object for everything below is the state-visitation distribution a policy induces — the marginal over states you actually see when you run it:
The central tension of imitation learning lives in one inequality: dπθ ≠ dπ*. The learner visits different states than the expert. Hold onto that — Sections B and C are entirely about its consequences.
Lecture 1 (Intro to Robot Learning) motivates learning over hand-engineering. Lecture 2 (Robot Control & MDPs) formalizes states, actions, dynamics, and policies — the vocabulary above. We build directly on both.
B · Behavioral Cloning — supervised learning of a policy
Behavioral cloning (BC) throws away the sequential structure and treats control as plain supervised learning. Collect a dataset of expert pairs D = { (si, ai) } drawn from the expert's own visitation dπ*, then fit a policy by maximum likelihood:
For a deterministic policy a = fθ(s) this is just regression — minimize squared error (the Gaussian-likelihood special case with fixed variance):
Why BC is the default — and why LeRobot ships it first:
- It is cheap. No reward, no environment interaction during training, no simulator. Just demos in, gradient steps out.
- It is stable. Supervised learning has well-understood optimization; you can overfit/underfit diagnose it with a normal train/val split.
- It scales with data, not with engineering. More good demos ⇒ a better policy, with no new code.
The catch is hidden in that expectation: it is taken over D∼dπ*, the expert's states. At test time the policy runs in closed loop and generates dπθ, its own states. BC optimizes the wrong distribution. Section C is the price of that mismatch.
act or diffusion policy is
πθ; lerobot-train runs exactly the BC objective above. The
Body policy we ship in week 2 is, at its core, this argmin.C · Covariate Shift — the central failure (and DAgger, the fix)
Suppose BC is good: it matches the expert on the expert's states with small per-step error. Define the per-step error as the probability the learner picks a different action than the expert under the expert's distribution:
Here is the heartbreak. The first time the learner errs, it lands in a state the expert never visited — off dπ*. There, training said nothing, so the error rate is no longer bounded; the policy can do anything, drift further, and visit states even more foreign. Errors don't add up — they compound. The classic result (Ross & Bagnell):
where J is expected trajectory cost and T the horizon. The expert pays O(ε T) (errors only add); BC pays an extra factor of T because each early mistake pushes the policy into unseen states where it keeps erring for the rest of the episode.
Why the extra T? — the argument in one breath
Walk it through. At each of T steps there is probability ≤ ε of a "first mistake." After a mistake the learner is off-distribution and we conservatively assume it accrues cost on every remaining step — up to T more steps. Summing the expected number of mistakes (≈ εT) times the cost-tail of each (≈ T) gives ≈ ε T2. Quadratic in the horizon. Long episodes are punished hardest.
Imagine BC driving the Kiwi straight at a cup. A tiny steering error nudges the base 2 cm off the demonstrated line. Now the camera sees the cup from an angle no demo ever showed. The policy guesses, oversteers, lands somewhere even more novel — and by the time it should reach the cup it is facing a wall. The expert never made that first error, so the expert never saw any of those states, so the dataset never taught the recovery. The failure is not a bad gradient — it is a distribution the model was never trained on.
DAgger — train on the states the learner actually visits
DAgger (Dataset Aggregation) closes the loop. Instead of only labeling the expert's states, you run the current learner, collect the states it visits, ask the expert "what would you do here?", add those new labels to the dataset, and retrain. Iterate. Over rounds the training distribution converges to dπθ — the policy's own visitation — turning the quadratic bound back into a linear one.
- Roll out the learner. Run πθ in the environment; record the states st it actually visits (including its mistakes).
- Query the expert. For each visited state, get the expert action at = π*(st) — the correct recovery.
- Aggregate. Append these (s,a) pairs to the dataset: D ← D ∪ { (st, π*(st)) }.
- Retrain. Refit πθ on the aggregated D; go back to step 1.
Three results from Lecture 3 sharpen how we collect and represent demos. Each is summarized below with a build badge.
Causal Confusion in Imitation Learning
Counter-intuitively, giving a BC policy more observation information can make it worse. The model latches onto whatever feature best predicts the expert action in the dataset — even a spurious one. The famous case is the inertia / "brake light" problem: a dashboard indicator that lights up because the expert is braking becomes the model's cue for braking, so the cloned driver freezes whenever it is already stopped. The cause-effect direction is confused. Fixes: intervene to break the spurious correlation (targeted expert queries, like DAgger) or learn a disentangled causal representation, then test which graph matches reality.
The Surprising Effectiveness of Representation Learning for Visual Imitation
Splits visual imitation into what to represent vs how to act. Use a self-supervised, frozen visual encoder (e.g. BYOL features) to embed observations, then act by non-parametric nearest-neighbours in that embedding space — locate the closest demonstrated states and copy their actions, no policy network trained at all. With good representations this matches or beats end-to-end BC from a handful of demos, and it is strikingly data-efficient. The lesson: a large share of visual imitation difficulty is a representation problem, not a control problem.
Transporter Networks: Rearranging the Visual World
For pick-and-place / rearrangement, instead of regressing raw end-effector coordinates, reframe action as spatial template-matching. The network predicts a pick location as a dense heat-map over the image, then learns a place by cross-correlating (transporting) the features around the pick point against the scene to score every candidate place pose. This spatial-equivariant formulation is dramatically sample-efficient — a handful of demos generalizes to new object arrangements — because the inductive bias matches the structure of the task. A clean example of action parameterization doing the heavy lifting.
D · The Multimodality Problem — why a plain regressor mode-averages
Covariate shift is about where the policy is. Multimodality is about what it predicts even on perfectly in-distribution states. Human demonstrations are multimodal: from the same state there are often several equally-good actions. Approaching an obstacle, the expert sometimes swerves left, sometimes right — both reach the goal. A unimodal model (an MSE regressor, or a single Gaussian) cannot represent "left or right." It is forced to commit to one number, and it picks the worst possible one: the average.
Minimize expected squared error of a deterministic predictor f(s) against the true action distribution p(a|s):
Take the gradient w.r.t. the output and set it to zero (drop the state for clarity, write f for f(s)):
The unique L2 minimizer is the conditional mean. If p(a|s) is bimodal at left and right, the mean sits between them — pointing straight at the obstacle, an action neither demonstrator ever took. (The same collapse afflicts a single-Gaussian likelihood: its MLE mean is also 𝔼[a|s].)
The whole reason the next page exists is this figure. To not mode-average we need policies that model the full, multimodal action distribution: CVAEs / ACT (latent-conditioned action chunks), diffusion policies, and flow matching. Each can place probability mass on both left and right and then sample one — never the dead-center average. Continue to 2 · Action Models.
E · The hidden assumption — i.i.d. supervised learning vs the closed loop
Both failures above trace to one broken assumption. Supervised learning assumes train and test data are drawn i.i.d. from the same fixed distribution. In control that is false: the policy generates its own test data by acting, so the test distribution is a function of the policy you trained. Train ≠ test, by construction.
| Aspect | i.i.d. supervised learning | Closed-loop control (our reality) |
|---|---|---|
| Test distribution | Fixed, equals train | Generated by the policy; depends on θ |
| Errors over time | Independent per example | Compound — O(ε T2) |
| Target structure | Often unimodal label | Multimodal expert actions |
| Fix | More i.i.d. data | DAgger-style on-policy labels + expressive policy |
Lab — Feel covariate shift and mode-averaging yourself
Pure PyTorch + NumPy + Matplotlib. No robot, no simulator, runs on CPU in seconds. The point is to see the theory above with your own eyes before we touch hardware.
# Python 3.10+ — versions are forgiving here, this is a toy.
pip install torch numpy matplotlib
- Generate a multimodal expert dataset. A point starts on the left, must reach a goal
on the right, with an obstacle in the middle. The scripted expert picks — per episode, at random —
the upper or lower detour. That coin-flip is what injects multimodality: from the same
start state, the expert action is sometimes "go up," sometimes "go down."
import numpy as np, torch, torch.nn as nn GOAL = np.array([9., 0.]); OBST = np.array([4.5, 0.]); R = 1.4 def expert_action(s, route): # route=+1 detour up, route=-1 detour down. Returns a velocity step. way = GOAL if s[0] > OBST[0] else OBST + np.array([0., route*2.2]) d = way - s; n = np.linalg.norm(d) + 1e-8 return (d / n) * 0.6 # fixed-speed step toward the waypoint def rollout_expert(route, steps=40): s = np.array([0., np.random.uniform(-.5, .5)]); S, A = [], [] for _ in range(steps): a = expert_action(s, route); S.append(s.copy()); A.append(a); s = s + a return np.array(S), np.array(A) S, A = [], [] for ep in range(120): # ~half go up, half go down s, a = rollout_expert(route=1 if ep % 2 else -1) S.append(s); A.append(a) Sd = torch.tensor(np.concatenate(S), dtype=torch.float32) Ad = torch.tensor(np.concatenate(A), dtype=torch.float32) print("dataset:", Sd.shape, Ad.shape) # ~4800 (s,a) pairs, bimodal near the obstacle
- Train an MSE behavioral-cloning MLP and roll it out. Standard regression. Then run
the policy in closed loop from a fresh start and watch two things: near the obstacle it predicts
the average of "up" and "down" (≈ straight = into the obstacle — mode-averaging), and any small
deviation puts it in states the demos never covered (covariate shift).
def mlp(out=2): return nn.Sequential(nn.Linear(2,128), nn.ReLU(), nn.Linear(128,128), nn.ReLU(), nn.Linear(128,out)) bc = mlp(); opt = torch.optim.Adam(bc.parameters(), 1e-3) for it in range(3000): # pure MSE = the L2 objective from Section D pred = bc(Sd); loss = ((pred - Ad)**2).mean() opt.zero_grad(); loss.backward(); opt.step() def rollout_policy(net, steps=60): s = np.array([0., 0.]); traj = [s.copy()] for _ in range(steps): a = net(torch.tensor(s, dtype=torch.float32)).detach().numpy() s = s + a; traj.append(s.copy()) return np.array(traj) traj = rollout_policy(bc) hit = ((traj - OBST)[:, :]**2).sum(1).min()**.5 < R print("closest approach to obstacle:", round(float(((traj-OBST)**2).sum(1).min()**.5),2), "| COLLISION" if hit else "| clear") # Expected: the MSE policy drives ~straight, collides, and/or drifts off into # states no demo covered. That is mode-averaging + covariate shift, live.
- One DAgger iteration. We have a scripted expert, so we can do the real
thing: roll out the current policy, relabel the states it actually visited with the expert's
action, aggregate, retrain. Error on the policy's own visitation drops.
def dagger_relabel(net, n=40): Sn, An = [], [] for _ in range(n): s = np.array([0., np.random.uniform(-.5,.5)]) route = np.random.choice([-1, 1]) # expert still has a preferred mode for _ in range(60): a_star = expert_action(s, route) # label = what the EXPERT would do here Sn.append(s.copy()); An.append(a_star) a = net(torch.tensor(s,dtype=torch.float32)).detach().numpy() # move by LEARNER s = s + a return torch.tensor(np.array(Sn),dtype=torch.float32), torch.tensor(np.array(An),dtype=torch.float32) Sx, Ax = dagger_relabel(bc) Sd2 = torch.cat([Sd, Sx]); Ad2 = torch.cat([Ad, Ax]) # AGGREGATE for it in range(3000): # RETRAIN on the union loss = ((bc(Sd2)-Ad2)**2).mean() opt.zero_grad(); loss.backward(); opt.step() print("after 1 DAgger round, on-policy MSE:", round(float(((bc(Sx)-Ax)**2).mean()),4)) # Expected: on-policy error drops — the model has now seen its own drift states. # Note it still mode-averages near the obstacle: DAgger fixes covariate shift, # NOT multimodality. Those are two different diseases. Hence step 4.
- (Optional) Swap the MSE head for a 2-mode mixture head. Predict two
candidate actions plus a logit choosing between them, and train with the "min-over-modes" loss so each
mode specializes (this is the toy ancestor of the CVAE/diffusion idea on Page 2). At rollout, sample
a mode — the policy now commits to a real detour instead of the dead-center average.
K = 2 head = mlp(out=K*2 + K) # K action means (2D each) + K mixing logits opt2 = torch.optim.Adam(head.parameters(), 1e-3) for it in range(4000): o = head(Sd); mus = o[:, :K*2].view(-1, K, 2); logit = o[:, K*2:] se = ((mus - Ad[:, None, :])**2).sum(-1) # (N,K) error to each mode win = se.argmin(1) # assign each sample to its closest mode reg = se[torch.arange(len(se)), win].mean() # "min-over-modes" regression loss cls = nn.functional.cross_entropy(logit, win) # learn the mixing weights loss = reg + cls opt2.zero_grad(); loss.backward(); opt2.step() def sample_action(s): o = head(torch.tensor(s,dtype=torch.float32)) mus = o[:K*2].view(K,2); k = torch.distributions.Categorical(logits=o[K*2:]).sample() return mus[k].detach().numpy() # COMMIT to one mode, never the mean # Expected: rolling out sample_action follows a clean upper OR lower detour and # reaches the goal. Re-run it a few times — you'll see BOTH routes recovered.
Expected observations
| Step | What you should see | Which theory it shows |
|---|---|---|
| 2 · MSE-BC | Policy drives roughly straight, grazes/hits the obstacle, then drifts into untrained regions | Mode-averaging (§D) + covariate shift (§C) |
| 3 · DAgger | On-policy MSE drops; the policy stops wandering into novel states — but still steers center near the obstacle | DAgger fixes covariate shift, not multimodality (§C) |
| 4 · 2-mode head | Rollouts pick a clean upper or lower route and reach the goal; re-sampling shows both | An expressive multimodal policy (teaser for §Page 2) |
Done when
- You can state in one sentence why MSE fails on multimodal data — it outputs 𝔼[a|s], the mean of the modes, which is itself not a valid action.
- You can explain why DAgger reduces compounding error — it trains on the learner's own visited states, shrinking the train/test distribution gap from O(εT2) toward O(εT).
- You observed empirically that DAgger and the 2-mode head fix different problems (covariate shift vs multimodality) — one does not substitute for the other.
- The 2-mode rollout recovers both detour routes across re-samples, not the average of them.
Cement the theory with the official course code: github.com/mees-robot-learning-course/ethz-course-2026. The Control / MDPs assignment (Lecture 2) and the Imitation Learning assignment (Lecture 3, including a DAgger implementation) are runnable and map one-to-one onto Sections A–C above. Do those before page 2.
Version note: the toy lab above is
pinned to nothing fancy — any recent torch ≥ 2.x works. The ETH repo may pin specific
versions; create a fresh venv from its requirements.txt rather than your LeRobot env to
avoid clashes.