Page 8 · Capstone

Problem Set

Derivations that prove you understand it. Each problem ties back to a theory page and to the ETH "Foundations of Robot Learning"–style lecture material. Attempt before you peek — the work is the point.

How to use this problem set

Treat every card as a closed-book exercise. Read the statement, grab paper, and derive it yourself before clicking Show solution. These problems map 1:1 onto the theory pages: when you stall, follow the cross-link back to the page that teaches the idea, then return and finish the derivation. Badges tell you the flavour:

core derivation systems

core = a result you must be able to reproduce on demand · derivation = pencil-and-paper math · systems = latency/architecture reasoning about the live creature. The three numbered "core" problems (1–3) are the load-bearing ones; the rest sweep the rest of the syllabus.

Theme A

Generative action heads — why we don't just regress

Problem 1 Re-derive the flow-matching target velocity

corederivation theory

Take the linear (optimal-transport) interpolation path xτ = (1−τ)x0 + τ x1 with x0 ∼ N(0, I) pure noise, x1 the data action (conditioned on observation o), and τ ∈ [0,1]. (a) Derive the ground-truth conditional velocity dxτ/dτ. (b) Write the Conditional Flow Matching (CFM) loss. (c) Give the sampling ODE and one Euler step. (d) Explain why this is exactly what SmolVLA / π0 train.

Show solution

(a) The conditional velocity

The path is linear in τ for a fixed pair (x0, x1). Differentiate component-wise, treating the endpoints as constants:

xτ = (1−τ)x0 + τx1
⟹  dxτ / dτ = −x0 + x1 = x1x0

This is the conditional velocity: it is a constant vector (independent of τ) pointing straight from the noise sample to the data sample. The whole probability path from N(0,I) to the data is built by averaging these straight lines over all pairs — the field a network must learn is the expectation of this constant over the pairs that pass through a given (xτ, τ).

(b) The Conditional Flow Matching loss

We regress a network vθ onto that target. Sample τ uniformly, noise x0, a data action x1 with its observation o, form xτ, and minimise the squared error:

LCFM = 𝔼τ, x0, x1, o  ‖ vθ(xτ, τ, o) − (x1x0) ‖2

The key fact (Lipman et al., 2023) is that this conditional objective has the same gradient as the intractable marginal flow-matching objective. Minimising the per-pair regression therefore makes vθ learn the marginal velocity field that transports the noise distribution onto the data distribution. We never need the marginal velocity in closed form — the magic is that regressing the simple constant target is enough.

(c) Sampling: the probability-flow ODE and an Euler step

At inference, start from noise and integrate the learned field forward in τ:

dx / dτ = vθ(x, τ, o),   x(0) ∼ N(0, I)

Discretise with N uniform steps of size Δτ = 1/N (explicit Euler):

τk+1 = τk + Δτ
xk+1 = xk + Δτ · vθ(xk, τk, o)

After N steps, xN ≈ a sample from p(action | o). Because the conditional paths are straight lines, the marginal field is unusually well-behaved, so a handful of Euler steps (often N = 5–10) already gives clean samples — that is the speed advantage flow matching has over many-step diffusion.

(d) Why SmolVLA / π0 train exactly this

  • The action head must output a distribution over action chunks, not a point — robot demonstrations are multimodal (see Problem 2). Flow matching is a mode-covering generative objective.
  • The training target is a plain MSE regression onto (x1−x0): stable, no adversarial game, no discrete codebook, plugs straight onto a transformer's hidden state conditioned on the VLM tokens (o = vision + language + proprioception).
  • Sampling is a cheap ODE with very few function evaluations, which matters for a real-time control loop on the LeKiwi — fewer NFEs than DDPM for comparable quality.

Problem 2 Show MSE collapses a two-mode action distribution

corederivation theory

From some state s, the expert action is +a with probability ½ and −a with probability ½ (a symmetric bimodal target — e.g. "go left around the obstacle" vs "go right"). (a) Prove the L2/MSE-optimal predictor is the conditional mean 𝔼[a|s] = 0 — it aims at the gap between the modes. (b) Prove the general fact argminf 𝔼‖a − f(s)‖² = 𝔼[a|s]. (c) Say why this dooms unimodal behavioral cloning and motivates CVAE / diffusion / flow-matching heads.

action p(a|s) −a +a 𝔼[a|s]=0 MSE predictor → the gap 🧱 obstacle
The two valid modes are ±a; the L2-optimal point estimate lands exactly between them — straight into the thing both experts were avoiding.
Show solution

(a) The concrete two-mode case

Let the predictor output a scalar c. The expected squared error is

J(c) = ½(ac)2 + ½(−ac)2

Differentiate and set to zero:

J′(c) = −(ac) − (−ac) = 2c = 0  ⟹  c = 0

Since J″(c) = 2 > 0, c = 0 is the unique minimum. The optimal point prediction is 0 = 𝔼[a|s] — a value the expert never produced. If +a and −a meant "veer left" / "veer right" around the obstacle, the averaged action drives straight into it.

(b) The general theorem

For any state s, minimise the expected squared error over predictors f. Add and subtract the conditional mean μ(s) := 𝔼[a|s]:

𝔼‖af(s)‖2 = 𝔼‖(a − μ(s)) + (μ(s) − f(s))‖2
= 𝔼‖a − μ(s)‖2 + 𝔼‖μ(s) − f(s)‖2 + 2 𝔼⟨a − μ(s),  μ(s) − f(s)⟩

Condition the cross term on s. Given s, the factor μ(s) − f(s) is constant, and 𝔼[a − μ(s) | s] = μ(s) − μ(s) = 0, so by the tower rule the cross term vanishes:

𝔼⟨a−μ(s), μ(s)−f(s)⟩ = 𝔼s[ ⟨𝔼[a−μ(s)|s],  μ(s)−f(s)⟩ ] = 0

So the objective splits into an irreducible noise term plus a non-negative term that is zeroed only by f = μ:

argminf  𝔼‖af(s)‖2 = μ(s) = 𝔼[a | s]

The least-squares-optimal function is always the conditional mean — full stop. Part (a) is just this theorem evaluated on the symmetric two-point distribution.

(c) Why this dooms unimodal BC, and the fix

  • A network trained with MSE (the default for naive behavioral cloning) is provably driven to the conditional mean. When the demonstrations are multimodal — and human teleop almost always is — the mean lands in a low-probability or invalid region.
  • Switching to a Gaussian head with learned variance does not save you: a single Gaussian still has one mode and just inflates variance, smearing probability across the bad gap.
  • The cure is a head that can represent a multimodal conditional distribution and from which we sample rather than average: CVAE (ACT, Problem 5), diffusion (Problem 6), or flow matching (Problem 1). Each commits to one mode per sample instead of averaging across modes.

Problem 5 The ACT / CVAE evidence lower bound

derivationtheory

ACT models p(a|s) with a conditional VAE over an action chunk. (a) Derive the ELBO log p(a|s) ≥ 𝔼q[log p(a|z,s)] − KL(q(z|a,s)‖p(z)). (b) Identify the reconstruction and regularization terms and the role of the reparameterization trick. (c) Relate the chunk to a shorter effective horizon.

Show solution

(a) Deriving the bound

Introduce a latent style variable z with prior p(z)=N(0,I) and an encoder (recognition model) q(z|a,s). Start from the log-evidence and insert q:

log p(a|s) = log ∫ p(a,z|s) dz = log 𝔼q(z|a,s) [ p(a,z|s) / q(z|a,s) ]

Apply Jensen's inequality (log is concave) to pull the expectation out:

≥ 𝔼q [ log p(a,z|s) − log q(z|a,s) ]
= 𝔼q[ log p(a|z,s) ] + 𝔼q[ log p(z) − log q(z|a,s) ]
= 𝔼q[ log p(a|z,s) ] − KL( q(z|a,s) ‖ p(z) )

The gap between LHS and RHS is exactly KL(q(z|a,s) ‖ p(z|a,s)) ≥ 0, which is why the inequality holds and is tight when the encoder matches the true posterior.

(b) Reading the two terms + reparameterization

  • Reconstruction 𝔼q[log p(a|z,s)]: decode the action chunk from the sampled style z and the state; under a Gaussian decoder this is the familiar L1/L2 reconstruction loss on the chunk.
  • Regularization KL(q(z|a,s)‖p(z)): keeps the encoder's posterior near the standard-normal prior so that, at test time, sampling z ∼ N(0,I) yields coherent chunks. ACT typically down-weights this term with a small β.
  • Reparameterization trick: to backprop through the sampling z ∼ q, write z = μφ(a,s) + σφ(a,s) ⊙ ε with ε ∼ N(0,I). The randomness moves into ε, so gradients flow through μφ, σφ deterministically.

(c) Chunking shortens the effective horizon

The decoder emits a whole chunk of k actions per latent draw. One policy decision now advances the controller k steps, cutting the number of autoregressive decision points over a horizon T from T to T/k — which directly tempers compounding error (made quantitative in Problem 9). The CVAE handles multimodality; chunking handles horizon. At inference you can discard the encoder entirely and just sample z ∼ p(z).

Problem 6 Diffusion Policy: the noise-prediction objective

derivationtheory

Start from the DDPM forward process q(at|a0) = N(√ᾱt a0, (1−ᾱt)I). (a) Derive/justify the noise-prediction training objective ‖ε − εθ(at,t,o)‖². (b) Describe a single reverse denoising step. (c) Contrast NFE / inference cost with flow matching.

Show solution

(a) From the closed-form forward process to the loss

The forward marginal lets us jump to any noise level t in one shot using the reparameterization

at = √ᾱt a0 + √(1−ᾱt) ε,   ε ∼ N(0, I)

The principled objective is the variational bound on log p(a0|o), which for Gaussian transitions reduces to predicting the posterior mean of each reverse step. Ho et al. (2020) show that, after dropping t-dependent weights, this is equivalent to the simple denoising-score / noise-prediction regression: have the network guess the noise ε that was added.

Lsimple = 𝔼t, a0, ε, o  ‖ ε − εθ(at, t, o) ‖2

Because ε determines a0 given at, predicting the noise is equivalent (up to scaling) to predicting the clean action or the score at log q(at). Conditioning on the observation o makes it a conditional generative head for actions.

(b) One reverse denoising step

Given at, predict the noise, take a step toward less noise, and inject a little fresh noise (DDPM ancestral sampling):

at−1 = (1/√αt) ( at − ((1−αt)/√(1−ᾱt))  εθ(at,t,o) ) + σt zzN(0,I)

Iterating from t = T down to 0 walks pure noise back to a clean action chunk. (A DDIM variant drops the stochastic term for a deterministic ODE-like trajectory.)

(c) NFE / cost vs flow matching

AspectDiffusion Policy (DDPM)Flow matching (P1)
Targetnoise ε per stepvelocity x1−x0
Path geometrycurved (variance schedule)straight lines
NFE at inferencetens–hundreds (DDIM cuts to ~10–50)often 5–10 Euler steps
Real-time fitheavier per actionlighter — why SmolVLA/π0 chose it

Both are mode-covering generative heads that fix the MSE collapse of Problem 2. Flow matching's straight conditional paths make the marginal field easier to integrate, so it usually hits the same sample quality in far fewer function evaluations — decisive for the creature's control loop.

Problem 9 Action chunking & compounding error

derivationsystems

Show quantitatively how predicting chunks of length k cuts the number of decision points over a horizon T from T to T/k, and thus tempers compounding error. Discuss the smoothness / latency trade-off and the n_action_steps gotcha.

Show solution

Decision points

If the policy is queried every step, a horizon T contains T independent chances to step off the expert distribution. Predict a chunk of k actions per query and execute the whole chunk open-loop, and the policy is only consulted

D = ⌈ T / k ⌉  decision points (vs T)

Why it tempers compounding error

Connect to Problem 4: BC's worst-case cost scales like the per-decision error rate ε times the square of the number of decision points. Re-derived over the chunked horizon, the relevant count drops from T to T/k, so the compounding penalty shrinks roughly by a factor of k² in the worst case — fewer junctions at which a single mistake can snowball into off-distribution states. Empirically (ACT) this is a big part of why chunking works.

The trade-offs

  • Smoothness vs reactivity: long open-loop chunks are smooth but blind — the robot can't react to a disturbance mid-chunk. Short chunks are reactive but reintroduce decision points (and jitter at chunk boundaries). Temporal ensembling / action overlap blends chunk boundaries to keep motion smooth.
  • Latency: a fresh chunk is computed only every k control ticks, amortising the (expensive) policy forward pass — relevant to the fast/slow split in Problem 10.
The n_action_steps gotcha

In LeRobot, the model may predict a chunk of size chunk_size but you choose how many of those to actually execute via n_action_steps before re-querying. Setting n_action_steps equal to the full predicted horizon makes the policy maximally open-loop (smooth but unreactive); setting it to 1 re-plans every tick (reactive but heavy, and discards most of the predicted chunk). Mis-matching these two is a classic source of either jittery or sluggish behaviour. Version-sensitive: confirm the field names against your installed LeRobot.

Theme B

Imitation theory — covariate shift

Problem 4 Covariate shift: the O(εT²) cost of behavioral cloning

derivationtheory

Argue/derive why behavioral cloning incurs O(εT²) expected cost over a horizon T while the expert is O(εT), and how DAgger restores ~O(εT).

Show solution

Setup

Suppose BC, evaluated on the expert's own state distribution, makes a mistake (an action the expert would not take) with probability at most ε per step. Assume a per-step cost in [0,1] and that one mistake can cost up to 1 per remaining step.

The quadratic blow-up

The trouble is distribution shift: the moment the learner errs, it visits states the expert never demonstrated, where the error rate is no longer bounded by ε. Bound it by the step at which the first mistake happens. The probability the first mistake lands at step t is ≈ ε, and from there up to T − t further steps can each incur cost ~1:

JBC) − J) ≲ Σt=1T ε·(Tt) ≈ ε · T2/2 = OT2)

(Ross & Bagnell, 2010.) The expert, by contrast, stays on its own distribution, so its cost over the horizon is just O(εT) — error accrues linearly, not quadratically. The extra factor of T is the price of compounding: errors push you off-distribution, which raises the future error rate, which pushes you further off.

How DAgger restores ~O(εT)

  • Roll out the learner, collecting the states it actually visits (including off-distribution ones).
  • Query the expert for the correct action on those states, and aggregate this into the dataset (DAgger = Dataset Aggregation).
  • Retrain and repeat. Training and test distributions now coincide, so the ε bound holds on the states actually encountered and the cost collapses back to the near-linear O(εT) regime (no-regret online-learning guarantee).
Build link

DAgger needs an interactive expert, which is awkward for teleop. The chunking trick of Problem 9 attacks the same T² term from the other side — fewer decision points — which is why we lean on chunked generative heads rather than full DAgger loops in the 14-day sprint.

Problem 11 Symbol grounding in a VLA bonus

systemstheory

Explain how a VLA grounds the language token "red cup" into an action distribution, and why end-to-end imitation sidesteps an explicit perception → symbol → plan pipeline (tie to Brooks, "Intelligence without Representation").

Show solution

Grounding "red cup" → actions

A VLA is a conditional policy p(action | image, instruction, proprio). The token sequence "pick up the red cup" enters the same transformer as the image patches and joint states. Cross-attention lets the language tokens select the visual features that co-occur with the action of grasping that object in the demonstrations. Nothing in the network ever materialises a discrete symbol RED_CUP with a typed slot; "red cup" is grounded purely as the statistical hinge that, conditioned on these pixels, shifts the learned action distribution toward the trajectories that reached for the red cup during training. Grounding = a conditioning effect on the output distribution, learned end-to-end.

Why end-to-end skips perception → symbol → plan

  • The classical pipeline (detect → build a symbolic world model → plan over symbols → execute) has brittle seams: each stage's errors compound, and the symbol vocabulary has to be designed up front.
  • End-to-end imitation collapses all stages into one differentiable map from sensors to actions, optimised jointly. There is no explicit world model to keep consistent — the policy only needs the regularities that the task actually exploits.
  • This is Brooks' "Intelligence without Representation": competent behaviour can emerge from a tight sensorimotor coupling without a central reconstructed model of the world. The VLA is the modern, learned incarnation — "the world is its own best model," now grounded in pretrained vision-language features rather than hand-built behaviours.
Honest caveat

No explicit symbols also means no easy guarantees, no clean interpretability of why "red cup" mapped where it did, and generalisation only as far as the data + pretraining support. The creature trades legibility for end-to-end robustness.

Theme C

World models — learning to imagine

Problem 7 RSSM / world-model KL: why the prior chases the posterior

derivationtheory

Write the world-model ELBO with the reconstruction term plus KL(q(zt|ht,xt) ‖ p(zt|ht)), and explain why making the prior match the posterior is exactly what enables latent imagination (rolling forward with no observations).

Show solution

The recurrent state-space model

A deterministic recurrent state ht = f(ht−1, zt−1, at−1) carries history. Two heads read it: the prior p(zt|ht) predicts the stochastic latent from history alone; the posterior q(zt|ht,xt) corrects it using the current observation xt.

The ELBO (per step)

L = Σt [  𝔼q[ log p(xt | ht, zt) ]  −  KL( q(zt|ht,xt) ‖ p(zt|ht) ) ]

Same Jensen-bound structure as Problem 5: a reconstruction term that forces the latent to retain enough about the observation to predict it, minus a KL term that aligns the observation-free prior with the observation-informed posterior.

Why the KL is the whole point

  • Minimising KL(q ‖ p) trains the prior to predict what the posterior would have inferred — i.e. to guess the next latent without seeing the next observation.
  • Once the prior is a good stand-in for the posterior, you can drop the observation entirely and roll the dynamics in latent space: sample zt ∼ p(zt|ht), update ht+1, repeat. That is latent imagination (Dreamer) — generating long synthetic rollouts to train a policy/value with no environment steps.
  • If the prior did not match the posterior, imagined trajectories would diverge from what real observations would have produced, and any policy trained on them would fail on the real robot. The KL term is precisely the contract that keeps imagination faithful.

Theme D

Voice & interaction — codecs and latency

Problem 3 Derive the Moshi latency budget from the frame rate

coresystems

Mimi (Moshi's neural codec) operates at 12.5 Hz. (a) Show each frame = 1/12.5 = 80 ms. (b) Show that with one frame of processing / lookahead the theoretical end-to-end latency ≈ 80 + 80 = 160 ms. (c) Discuss what adds to this in practice, and how full-duplex (no turn-taking gate) keeps the felt latency low.

Show solution

(a) Frame duration

The frame rate is the number of codec frames per second; the frame period is its reciprocal:

Tframe = 1 / f = 1 / 12.5 Hz = 0.08 s = 80 ms

So Mimi emits one set of audio tokens every 80 ms, and the language model runs at this same 12.5 Hz token clock.

(b) Theoretical end-to-end latency

Two unavoidable 80 ms quanta dominate the ideal pipeline:

Te2eTin + Tout = 80 ms (fill an input frame) + 80 ms (one frame of processing / lookahead) = 160 ms

You must accumulate ~one frame of incoming audio before there is anything to encode, and the model needs at least one frame of compute/lookahead before it can emit the first output frame. 160 ms is the architectural floor implied purely by the 12.5 Hz clock.

(c) What adds to it in practice, and the full-duplex win

  • Real additions: model forward-pass time (must beat the 80 ms budget to run real-time), audio buffering on capture/playback, network/RTC jitter buffers, OS audio stack, and any extra lookahead frames the codec uses. Real systems land higher than 160 ms.
  • Full-duplex is the differentiator: classic voice assistants insert a turn-taking gate — VAD waits for end-of-speech, then ASR → LLM → TTS run in sequence, stacking hundreds of ms of silence-detection and pipeline delay on top of compute. Moshi instead models both streams (user + assistant) at the same 12.5 Hz clock simultaneously, so it can listen and speak at once.
  • With no end-of-turn gate, the model can start responding, back-channel ("mm-hm"), or interrupt within roughly one frame of the user's audio. The felt latency approaches the 160 ms floor instead of the multi-second feel of gated cascades — which is exactly what makes the creature feel alive.

Problem 8 RVQ codec: bitrate and semantic vs acoustic tokens

derivationsystems

For residual vector quantization with L codebooks, write x ≈ Σ qi with residuals ri = ri−1 − qi. (a) Compute the bitrate for codebook size K and frame rate f. (b) Plug in Mimi-like numbers. (c) Explain semantic vs acoustic tokens.

Show solution

(a) RVQ and its bitrate

Residual VQ quantizes the vector in successive refinement stages. Start with r0 = x; at stage i, pick the nearest entry qi from codebook i to the current residual, then subtract it:

qi = argminc ∈ Ciri−1c ‖,   ri = ri−1qi,   x ≈ Σi=1L qi

Each of the L stages emits one index into a codebook of size K, costing log2K bits. At f frames per second:

bits/s = L · log2(K) · f

(b) Mimi-like numbers

Take L = 8 codebooks, K = 2048 (so log2K = 11 bits), f = 12.5 Hz:

bits/s = 8 · 11 · 12.5 = 1100 bit/s ≈ 1.1 kbps

That is the headline: telephone-quality-and-up speech tokenised at roughly 1 kbps, which is what makes it cheap enough for an LLM to model audio token-by-token at the 80 ms clock. (Exact Mimi config is version-sensitive — confirm L, K against the release you run.)

(c) Semantic vs acoustic tokens

  • Semantic token: the first codebook is distilled to align with a self-supervised speech model (e.g. WavLM), so it carries linguistic / content information — what was said. This makes the token stream easy for a language model to reason over.
  • Acoustic tokens: the remaining residual codebooks capture how it sounds — speaker timbre, prosody, recording conditions — the detail needed to reconstruct natural-sounding audio.
  • Splitting the two lets one model predict meaning and generate faithful audio from the same low-bitrate stream — the foundation of Moshi's full-duplex speech (Problem 3).

Theme E

Systems — making the creature feel alive

Problem 10 Fast/slow decoupling: protecting the latency budget

systems

A fast loop runs at ~200 ms; a slow VLA loop takes ~1–3 s. Show why running them synchronously destroys the conversational latency budget, and quantify the felt-latency benefit of running them async with phase reporting.

Synchronous (coupled) VLA forward pass — 1–3 s (everything waits) fast loop STALLED — no response, no motion Asynchronous (decoupled) fast loop keeps ticking @200 ms (voice + idle motion) VLA computes in background → "thinking…" phase reported result swapped in when ready, no stall fast loop slow VLA loop
Sync couples the conversational loop to the slowest component; async lets the fast loop hide the VLA's latency behind continuous presence.
Show solution

Why synchronous coupling is fatal

If the fast loop must wait for each VLA call to return before it can act, the effective tick period of the whole creature becomes the slow loop's period:

Tsync = max(Tfast, Tslow) = max(200 ms, 1–3 s) = 1–3 s per response

Compare against the conversational floor from Problem 3 (~160 ms). A 1–3 s gate is 6–19× over budget: the creature freezes — no voice, no micro-motion — every time it "thinks." Felt latency is dominated by the slowest component.

Async with phase reporting

Decouple the loops onto separate clocks. The fast loop keeps running at 200 ms (voice back-channels, gaze, idle sway) while the VLA computes in the background; when the new action chunk is ready it is swapped in. The user-perceived latency of the interactive channel becomes the fast period, independent of the VLA:

TfeltTfast = 200 ms  (vs 1–3 s)
speed-up ≈ Tslow / Tfast ≈ (1000–3000)/200 = 5×–15× lower felt latency
  • Phase reporting closes the gap that remains: while the VLA runs, the fast loop emits a "thinking…" cue (a verbal "let me look", a head tilt, a status light). The delay becomes legible rather than dead air — perceived responsiveness rises even further than the raw number.
  • Staleness vs responsiveness: the executed action chunk may be up to one slow-loop period old. Keep chunks short enough (Problem 9) that mild staleness is harmless, and let the fast loop handle anything reactive.

Where to go next

If a derivation didn't click, the cross-link on each card points to the page that builds the idea from scratch. When all eleven feel routine, you can reconstruct the entire ANIMA-Kiwi stack — generative action heads, imitation theory, world models, the speech codec, and the fast/slow orchestration — from first principles. That is the bar.