Derivations that prove you understand it. Each problem ties back to a theory
page and to the ETH "Foundations of Robot Learning"–style lecture material. Attempt before
you peek — the work is the point.
How to use this problem set
Treat every card as a closed-book exercise. Read the statement, grab paper, and
derive it yourself before clicking Show solution. These
problems map 1:1 onto the theory pages: when you stall, follow the cross-link back to
the page that teaches the idea, then return and finish the derivation. Badges tell you
the flavour:
corederivationsystems
core = a result you must be able to reproduce on demand ·
derivation = pencil-and-paper math · systems =
latency/architecture reasoning about the live creature. The three numbered "core"
problems (1–3) are the load-bearing ones; the rest sweep the rest of the syllabus.
Theme A
Generative action heads — why we don't just regress
Problem 1 Re-derive the flow-matching target velocity
corederivationtheory
Take the linear (optimal-transport) interpolation path
xτ = (1−τ)x0 + τ x1 with
x0 ∼ N(0, I) pure noise, x1 the
data action (conditioned on observation o), and τ ∈ [0,1].
(a) Derive the ground-truth conditional velocity dxτ/dτ.
(b) Write the Conditional Flow Matching (CFM) loss.
(c) Give the sampling ODE and one Euler step.
(d) Explain why this is exactly what SmolVLA / π0 train.
The path is linear in τ for a fixed pair (x0, x1).
Differentiate component-wise, treating the endpoints as constants:
xτ = (1−τ)x0 + τx1
⟹ dxτ / dτ = −x0 + x1 = x1 − x0
This is the conditional velocity: it is a constant vector (independent
of τ) pointing straight from the noise sample to the data sample. The whole probability
path from N(0,I) to the data is built by averaging these straight lines over all
pairs — the field a network must learn is the expectation of this constant over
the pairs that pass through a given (xτ, τ).
(b) The Conditional Flow Matching loss
We regress a network vθ onto that target. Sample τ uniformly,
noise x0, a data action x1 with its observation
o, form xτ, and minimise the squared error:
The key fact (Lipman et al., 2023) is that this conditional objective has the
same gradient as the intractable marginal flow-matching objective. Minimising the
per-pair regression therefore makes vθ learn the marginal velocity field
that transports the noise distribution onto the data distribution. We never need the
marginal velocity in closed form — the magic is that regressing the simple constant target
is enough.
(c) Sampling: the probability-flow ODE and an Euler step
At inference, start from noise and integrate the learned field forward in τ:
dx / dτ = vθ(x, τ, o), x(0) ∼ N(0, I)
Discretise with N uniform steps of size Δτ = 1/N (explicit Euler):
τk+1 = τk + Δτ
xk+1 = xk + Δτ · vθ(xk, τk, o)
After N steps, xN ≈ a sample from p(action | o).
Because the conditional paths are straight lines, the marginal field is unusually
well-behaved, so a handful of Euler steps (often N = 5–10) already gives clean
samples — that is the speed advantage flow matching has over many-step diffusion.
(d) Why SmolVLA / π0 train exactly this
The action head must output a distribution over action chunks, not
a point — robot demonstrations are multimodal (see Problem 2). Flow matching is a
mode-covering generative objective.
The training target is a plain MSE regression onto
(x1−x0): stable, no adversarial game, no discrete codebook,
plugs straight onto a transformer's hidden state conditioned on the VLM tokens
(o = vision + language + proprioception).
Sampling is a cheap ODE with very few function evaluations, which
matters for a real-time control loop on the LeKiwi — fewer NFEs than DDPM for comparable
quality.
∎
Problem 2 Show MSE collapses a two-mode action distribution
corederivationtheory
From some state s, the expert action is +a with probability ½
and −a with probability ½ (a symmetric bimodal target — e.g. "go left around the
obstacle" vs "go right"). (a) Prove the L2/MSE-optimal predictor is the
conditional mean 𝔼[a|s] = 0 — it aims at the gap between the modes.
(b) Prove the general fact argminf 𝔼‖a − f(s)‖² = 𝔼[a|s].
(c) Say why this dooms unimodal behavioral cloning and motivates
CVAE / diffusion / flow-matching heads.
The two valid modes are ±a; the L2-optimal point estimate lands exactly between
them — straight into the thing both experts were avoiding.Show solution
(a) The concrete two-mode case
Let the predictor output a scalar c. The expected squared error is
J(c) = ½(a − c)2 + ½(−a − c)2
Differentiate and set to zero:
J′(c) = −(a − c) − (−a − c) = 2c = 0 ⟹ c⋆ = 0
Since J″(c) = 2 > 0, c⋆ = 0 is the
unique minimum. The optimal point prediction is 0 = 𝔼[a|s] — a value the
expert never produced. If +a and −a meant "veer left" / "veer right" around the
obstacle, the averaged action drives straight into it.
(b) The general theorem
For any state s, minimise the expected squared error over predictors f.
Add and subtract the conditional mean μ(s) := 𝔼[a|s]:
Condition the cross term on s. Given s, the factor
μ(s) − f(s) is constant, and 𝔼[a − μ(s) | s] = μ(s) − μ(s) = 0, so by the
tower rule the cross term vanishes:
So the objective splits into an irreducible noise term plus a non-negative term that is
zeroed only by f = μ:
argminf 𝔼‖a − f(s)‖2 = μ(s) = 𝔼[a | s]
The least-squares-optimal function is always the conditional mean — full stop.
Part (a) is just this theorem evaluated on the symmetric two-point distribution.
(c) Why this dooms unimodal BC, and the fix
A network trained with MSE (the default for naive behavioral cloning) is provably
driven to the conditional mean. When the demonstrations are multimodal — and human
teleop almost always is — the mean lands in a low-probability or invalid region.
Switching to a Gaussian head with learned variance does not save you: a single Gaussian
still has one mode and just inflates variance, smearing probability across the bad gap.
The cure is a head that can represent a multimodal conditional distribution
and from which we sample rather than average:
CVAE (ACT, Problem 5), diffusion (Problem 6), or
flow matching (Problem 1). Each commits to one mode per sample instead of
averaging across modes.
∎
Problem 5 The ACT / CVAE evidence lower bound
derivationtheory
ACT models p(a|s) with a conditional VAE over an action chunk.
(a) Derive the ELBO
log p(a|s) ≥ 𝔼q[log p(a|z,s)] − KL(q(z|a,s)‖p(z)).
(b) Identify the reconstruction and regularization terms and the role of the
reparameterization trick. (c) Relate the chunk to a shorter effective horizon.
The gap between LHS and RHS is exactly KL(q(z|a,s) ‖ p(z|a,s)) ≥ 0, which is why
the inequality holds and is tight when the encoder matches the true posterior.
(b) Reading the two terms + reparameterization
Reconstruction𝔼q[log p(a|z,s)]: decode the action
chunk from the sampled style z and the state; under a Gaussian decoder this is the
familiar L1/L2 reconstruction loss on the chunk.
RegularizationKL(q(z|a,s)‖p(z)): keeps the encoder's
posterior near the standard-normal prior so that, at test time, sampling
z ∼ N(0,I) yields coherent chunks. ACT typically down-weights this term with a
small β.
Reparameterization trick: to backprop through the sampling
z ∼ q, write z = μφ(a,s) + σφ(a,s) ⊙ ε with
ε ∼ N(0,I). The randomness moves into ε, so gradients flow through
μφ, σφ deterministically.
(c) Chunking shortens the effective horizon
The decoder emits a whole chunk of k actions per latent draw. One policy
decision now advances the controller k steps, cutting the number of
autoregressive decision points over a horizon T from T to T/k —
which directly tempers compounding error (made quantitative in Problem 9). The CVAE handles
multimodality; chunking handles horizon. At inference you can discard the encoder entirely
and just sample z ∼ p(z).
∎
Problem 6 Diffusion Policy: the noise-prediction objective
derivationtheory
Start from the DDPM forward process
q(at|a0) = N(√ᾱt a0, (1−ᾱt)I).
(a) Derive/justify the noise-prediction training objective
‖ε − εθ(at,t,o)‖². (b) Describe a single
reverse denoising step. (c) Contrast NFE / inference cost with flow matching.
(a) From the closed-form forward process to the loss
The forward marginal lets us jump to any noise level t in one shot using the
reparameterization
at = √ᾱta0 + √(1−ᾱt) ε, ε ∼ N(0, I)
The principled objective is the variational bound on log p(a0|o), which
for Gaussian transitions reduces to predicting the posterior mean of each reverse step.
Ho et al. (2020) show that, after dropping t-dependent weights, this is equivalent to
the simple denoising-score / noise-prediction regression: have the network guess the noise
ε that was added.
Lsimple = 𝔼t, a0, ε, o ‖ ε − εθ(at, t, o) ‖2
Because ε determines a0 given at, predicting
the noise is equivalent (up to scaling) to predicting the clean action or the score
∇at log q(at). Conditioning on the observation
o makes it a conditional generative head for actions.
(b) One reverse denoising step
Given at, predict the noise, take a step toward less noise, and inject
a little fresh noise (DDPM ancestral sampling):
at−1 = (1/√αt) ( at − ((1−αt)/√(1−ᾱt)) εθ(at,t,o) ) + σtz, z ∼ N(0,I)
Iterating from t = T down to 0 walks pure noise back to a clean action
chunk. (A DDIM variant drops the stochastic term for a deterministic ODE-like trajectory.)
(c) NFE / cost vs flow matching
Aspect
Diffusion Policy (DDPM)
Flow matching (P1)
Target
noise ε per step
velocity x1−x0
Path geometry
curved (variance schedule)
straight lines
NFE at inference
tens–hundreds (DDIM cuts to ~10–50)
often 5–10 Euler steps
Real-time fit
heavier per action
lighter — why SmolVLA/π0 chose it
Both are mode-covering generative heads that fix the MSE collapse of Problem 2. Flow
matching's straight conditional paths make the marginal field easier to integrate, so it
usually hits the same sample quality in far fewer function evaluations — decisive for the
creature's control loop.
∎
Problem 9 Action chunking & compounding error
derivationsystems
Show quantitatively how predicting chunks of length k cuts the number of
decision points over a horizon T from T to T/k, and thus tempers
compounding error. Discuss the smoothness / latency trade-off and the n_action_steps
gotcha.
If the policy is queried every step, a horizon T contains T independent
chances to step off the expert distribution. Predict a chunk of k actions per query
and execute the whole chunk open-loop, and the policy is only consulted
D = ⌈ T / k ⌉ decision points (vs T)
Why it tempers compounding error
Connect to Problem 4: BC's worst-case cost scales like the per-decision error rate ε times
the square of the number of decision points. Re-derived over the chunked horizon, the relevant
count drops from T to T/k, so the compounding penalty shrinks roughly by a
factor of k² in the worst case — fewer junctions at which a single mistake can snowball
into off-distribution states. Empirically (ACT) this is a big part of why chunking works.
The trade-offs
Smoothness vs reactivity: long open-loop chunks are smooth but blind —
the robot can't react to a disturbance mid-chunk. Short chunks are reactive but reintroduce
decision points (and jitter at chunk boundaries). Temporal ensembling / action
overlap blends chunk boundaries to keep motion smooth.
Latency: a fresh chunk is computed only every k control ticks,
amortising the (expensive) policy forward pass — relevant to the fast/slow split in
Problem 10.
The n_action_steps gotcha
In LeRobot, the model may predict a chunk of size
chunk_size but you choose how many of those to actually execute via
n_action_steps before re-querying. Setting n_action_steps equal to
the full predicted horizon makes the policy maximally open-loop (smooth but unreactive);
setting it to 1 re-plans every tick (reactive but heavy, and discards most of the predicted
chunk). Mis-matching these two is a classic source of either jittery or sluggish behaviour.
Version-sensitive: confirm the field names against your installed LeRobot.
∎
Theme B
Imitation theory — covariate shift
Problem 4 Covariate shift: the O(εT²) cost of behavioral cloning
derivationtheory
Argue/derive why behavioral cloning incurs O(εT²) expected cost over a
horizon T while the expert is O(εT), and how DAgger restores ~O(εT).
Suppose BC, evaluated on the expert's own state distribution, makes a mistake
(an action the expert would not take) with probability at most ε per step. Assume a per-step
cost in [0,1] and that one mistake can cost up to 1 per remaining step.
The quadratic blow-up
The trouble is distribution shift: the moment the learner errs, it visits states
the expert never demonstrated, where the error rate is no longer bounded by ε. Bound it by
the step at which the first mistake happens. The probability the first mistake lands at step
t is ≈ ε, and from there up to T − t further steps can each incur cost ~1:
(Ross & Bagnell, 2010.) The expert, by contrast, stays on its own distribution, so its
cost over the horizon is just O(εT) — error accrues linearly, not quadratically. The
extra factor of T is the price of compounding: errors push you
off-distribution, which raises the future error rate, which pushes you further off.
How DAgger restores ~O(εT)
Roll out the learner, collecting the states it actually visits
(including off-distribution ones).
Query the expert for the correct action on those states, and
aggregate this into the dataset (DAgger = Dataset Aggregation).
Retrain and repeat. Training and test distributions now coincide, so the
ε bound holds on the states actually encountered and the cost collapses back to the
near-linear O(εT) regime (no-regret online-learning guarantee).
Build link
DAgger needs an interactive expert, which is awkward for teleop.
The chunking trick of Problem 9 attacks the same T² term from the other side — fewer
decision points — which is why we lean on chunked generative heads rather than full DAgger
loops in the 14-day sprint.
∎
Problem 11 Symbol grounding in a VLA bonus
systemstheory
Explain how a VLA grounds the language token "red cup" into an action distribution,
and why end-to-end imitation sidesteps an explicit perception → symbol → plan pipeline (tie to
Brooks, "Intelligence without Representation").
A VLA is a conditional policy p(action | image, instruction, proprio). The token
sequence "pick up the red cup" enters the same transformer as the image patches and joint
states. Cross-attention lets the language tokens select the visual features that
co-occur with the action of grasping that object in the demonstrations. Nothing in the network
ever materialises a discrete symbol RED_CUP with a typed slot; "red cup" is
grounded purely as the statistical hinge that, conditioned on these pixels, shifts the
learned action distribution toward the trajectories that reached for the red cup during
training. Grounding = a conditioning effect on the output distribution, learned end-to-end.
Why end-to-end skips perception → symbol → plan
The classical pipeline (detect → build a symbolic world model → plan over symbols →
execute) has brittle seams: each stage's errors compound, and the symbol vocabulary has to
be designed up front.
End-to-end imitation collapses all stages into one differentiable map from sensors to
actions, optimised jointly. There is no explicit world model to keep consistent — the policy
only needs the regularities that the task actually exploits.
This is Brooks' "Intelligence without Representation": competent behaviour can
emerge from a tight sensorimotor coupling without a central reconstructed model of the world.
The VLA is the modern, learned incarnation — "the world is its own best model," now grounded
in pretrained vision-language features rather than hand-built behaviours.
Honest caveat
No explicit symbols also means no easy guarantees, no clean
interpretability of why "red cup" mapped where it did, and generalisation only as far
as the data + pretraining support. The creature trades legibility for end-to-end robustness.
∎
Theme C
World models — learning to imagine
Problem 7 RSSM / world-model KL: why the prior chases the posterior
derivationtheory
Write the world-model ELBO with the reconstruction term plus
KL(q(zt|ht,xt) ‖ p(zt|ht)), and
explain why making the prior match the posterior is exactly what enables latent
imagination (rolling forward with no observations).
A deterministic recurrent state ht = f(ht−1, zt−1,
at−1) carries history. Two heads read it: the priorp(zt|ht) predicts the stochastic latent from history alone;
the posteriorq(zt|ht,xt) corrects it
using the current observation xt.
Same Jensen-bound structure as Problem 5: a reconstruction term that forces
the latent to retain enough about the observation to predict it, minus a KL
term that aligns the observation-free prior with the observation-informed posterior.
Why the KL is the whole point
Minimising KL(q ‖ p) trains the prior to predict what the posterior
would have inferred — i.e. to guess the next latent without seeing the next
observation.
Once the prior is a good stand-in for the posterior, you can drop the observation entirely
and roll the dynamics in latent space: sample zt ∼ p(zt|ht),
update ht+1, repeat. That is latent imagination (Dreamer)
— generating long synthetic rollouts to train a policy/value with no environment steps.
If the prior did not match the posterior, imagined trajectories would diverge from
what real observations would have produced, and any policy trained on them would fail on the
real robot. The KL term is precisely the contract that keeps imagination faithful.
∎
Theme D
Voice & interaction — codecs and latency
Problem 3 Derive the Moshi latency budget from the frame rate
coresystems
Mimi (Moshi's neural codec) operates at 12.5 Hz.
(a) Show each frame = 1/12.5 = 80 ms. (b) Show that with one
frame of processing / lookahead the theoretical end-to-end latency ≈ 80 + 80 = 160 ms.
(c) Discuss what adds to this in practice, and how full-duplex (no turn-taking
gate) keeps the felt latency low.
The frame rate is the number of codec frames per second; the frame period is its
reciprocal:
Tframe = 1 / f = 1 / 12.5 Hz = 0.08 s = 80 ms
So Mimi emits one set of audio tokens every 80 ms, and the language model runs at this same
12.5 Hz token clock.
(b) Theoretical end-to-end latency
Two unavoidable 80 ms quanta dominate the ideal pipeline:
Te2e ≈ Tin + Tout = 80 ms (fill an input frame) + 80 ms (one frame of processing / lookahead) = 160 ms
You must accumulate ~one frame of incoming audio before there is anything to encode, and the
model needs at least one frame of compute/lookahead before it can emit the first output frame.
160 ms is the architectural floor implied purely by the 12.5 Hz clock.
(c) What adds to it in practice, and the full-duplex win
Real additions: model forward-pass time (must beat the 80 ms budget to
run real-time), audio buffering on capture/playback, network/RTC jitter buffers, OS audio
stack, and any extra lookahead frames the codec uses. Real systems land higher than 160 ms.
Full-duplex is the differentiator: classic voice assistants insert a
turn-taking gate — VAD waits for end-of-speech, then ASR → LLM → TTS run in sequence,
stacking hundreds of ms of silence-detection and pipeline delay on top of compute. Moshi
instead models both streams (user + assistant) at the same 12.5 Hz clock
simultaneously, so it can listen and speak at once.
With no end-of-turn gate, the model can start responding, back-channel
("mm-hm"), or interrupt within roughly one frame of the user's audio. The
felt latency approaches the 160 ms floor instead of the multi-second feel of
gated cascades — which is exactly what makes the creature feel alive.
∎
Problem 8 RVQ codec: bitrate and semantic vs acoustic tokens
derivationsystems
For residual vector quantization with L codebooks, write
x ≈ Σ qi with residuals ri = ri−1 − qi.
(a) Compute the bitrate for codebook size K and frame rate f.
(b) Plug in Mimi-like numbers. (c) Explain semantic vs acoustic
tokens.
Residual VQ quantizes the vector in successive refinement stages. Start with
r0 = x; at stage i, pick the nearest entry qi
from codebook i to the current residual, then subtract it:
qi = argminc ∈ Ci ‖ ri−1 − c ‖, ri = ri−1 − qi, x ≈ Σi=1Lqi
Each of the L stages emits one index into a codebook of size K, costing
log2K bits. At f frames per second:
bits/s = L · log2(K) · f
(b) Mimi-like numbers
Take L = 8 codebooks, K = 2048 (so log2K = 11 bits),
f = 12.5 Hz:
bits/s = 8 · 11 · 12.5 = 1100 bit/s ≈ 1.1 kbps
That is the headline: telephone-quality-and-up speech tokenised at roughly 1 kbps, which is
what makes it cheap enough for an LLM to model audio token-by-token at the 80 ms clock. (Exact
Mimi config is version-sensitive — confirm L, K against the release you run.)
(c) Semantic vs acoustic tokens
Semantic token: the first codebook is distilled to align with a
self-supervised speech model (e.g. WavLM), so it carries linguistic / content
information — what was said. This makes the token stream easy for a language model to reason
over.
Acoustic tokens: the remaining residual codebooks capture
how it sounds — speaker timbre, prosody, recording conditions — the detail needed to
reconstruct natural-sounding audio.
Splitting the two lets one model predict meaning and generate faithful audio
from the same low-bitrate stream — the foundation of Moshi's full-duplex speech (Problem 3).
∎
Theme E
Systems — making the creature feel alive
Problem 10 Fast/slow decoupling: protecting the latency budget
systems
A fast loop runs at ~200 ms; a slow VLA loop takes ~1–3 s. Show why running them
synchronously destroys the conversational latency budget, and quantify the felt-latency
benefit of running them async with phase reporting.
Sync couples the conversational loop to the slowest component; async lets the fast
loop hide the VLA's latency behind continuous presence.Show solution
Why synchronous coupling is fatal
If the fast loop must wait for each VLA call to return before it can act, the effective tick
period of the whole creature becomes the slow loop's period:
Tsync = max(Tfast, Tslow) = max(200 ms, 1–3 s) = 1–3 s per response
Compare against the conversational floor from Problem 3 (~160 ms). A 1–3 s gate is
6–19× over budget: the creature freezes — no voice, no micro-motion — every
time it "thinks." Felt latency is dominated by the slowest component.
Async with phase reporting
Decouple the loops onto separate clocks. The fast loop keeps running at 200 ms (voice
back-channels, gaze, idle sway) while the VLA computes in the background; when the new
action chunk is ready it is swapped in. The user-perceived latency of the interactive
channel becomes the fast period, independent of the VLA:
Phase reporting closes the gap that remains: while the VLA runs, the fast
loop emits a "thinking…" cue (a verbal "let me look", a head tilt, a status light). The delay
becomes legible rather than dead air — perceived responsiveness rises even further
than the raw number.
Staleness vs responsiveness: the executed action chunk may be up to one
slow-loop period old. Keep chunks short enough (Problem 9) that mild staleness is harmless,
and let the fast loop handle anything reactive.
∎
Where to go next
If a derivation didn't click, the cross-link on each card points to
the page that builds the idea from scratch. When all eleven feel routine, you can reconstruct the
entire ANIMA-Kiwi stack — generative action heads, imitation theory, world models, the speech
codec, and the fast/slow orchestration — from first principles. That is the bar.