Day 3–5 · Track 1 — Body

Action Models

Page 1 ended with a problem: behavior cloning with an MSE/Gaussian head averages the demonstrations, so at a fork in the road the robot drives straight into the wall between the two good options. The fix is an expressive, multimodal action head. Three families do this — latent-variable (ACT/CVAE), denoising (Diffusion Policy), and continuous-time transport (Flow Matching) — and the last one is what our VLA will ship.

A · Recap & roadmap

Recall the compounding-error result from page 1: a behavior-cloning policy with per-step error ε drifts off the expert distribution at rate O(ε T2) over a horizon T. But there is a deeper failure even with zero noise: a unimodal head trained by MSE collapses two valid trajectories into their average, which is often invalid. The cure is a head that can represent a distribution over action chunks, not a single mean. Below are the three families we cover, all conditioned on observation o (vision + state, and later language).

observation o vision + state (+ lang) Latent-variable · CVAE → ACT sample z ∼ prior, decode chunk a Denoising · Diffusion Policy iteratively denoise noise → action Transport · Flow Matching integrate ODE noise → action (few steps) action chunk at:t+k (k≈50) page 3 SmolVLA π0 use flow matching
Three expressive heads, one job: turn an observation into a distribution over the next ~50 actions, then sample one chunk.
latent-variable (CVAE/ACT) denoising (Diffusion) transport (Flow Matching)

Foreshadow: by page 3 we converge on flow matching, because both SmolVLA and π0 attach a flow-matching action expert to a VLM backbone. The math here is the foundation for that.

B · Action chunking & ACT

Maps to ETH Lecture 7 · Sequence Modeling — Zhao et al., "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" (ALOHA / ACT)

B.1 · Action chunking — predict k actions at once

Instead of predicting one action per observation and re-querying every step, ACT predicts a chunk of the next k actions in a single forward pass. This is the single most important trick for cheap-hardware imitation learning, and it attacks compounding error directly.

standard BC: at = π(ot)  →  decide every step, horizon = T
chunked: at:t+k = π(ot)  →  decide every k steps, effective horizon = T/k

If error compounds like O(ε T2) in the number of decision points, then cutting decision points by a factor of k cuts the compounding term by roughly k2. Chunking also lets the policy commit to a coherent motion (e.g. a full grasp) rather than re-deciding mid-reach, which is where unimodal indecision causes pauses and jitter.

Temporal ensembling

Overlapping chunks predicted at consecutive timesteps give several predictions for the same future action. ACT averages them with an exponential weight, which smooths the executed trajectory without adding latency:

ât = Σi wiat(i)  /  Σi wi,    wi = exp(−m·i)

where i indexes how old a prediction is and m sets how fast old predictions decay.

Practical gotcha

At deploy, the runner's n_action_steps (how many actions you actually execute before re-querying the policy) must equal the chunk length — about 50 on LeRobot ACT — not 1. If you re-query every step, you throw away chunking, the policy chatters between modes, and the arm moves in jerky stutters. Always confirm n_action_steps ≈ chunk_size in the checkpoint config (see the lab).

B.2 · ACT is a conditional VAE

A chunk of demonstrated actions is multimodal: the same observation can correspond to several valid motions, and within a chunk there is natural variation in style. ACT models p(achunk | o) with a latent variable z — a conditional VAE — so the decoder can produce different coherent chunks for the same observation by varying z.

TRAIN — encoder q(z | a, o) (CVAE only at training time) action chunk a1:k obs o encoder transformer μ, σ → z reparam. trick decoder transformer o conditions decoder (always) â1:k reconstructed chunk TEST — drop the encoder: set z = 0 (prior mean) or sample z ∼ N(0,I), then decode
At training the encoder sees the true chunk; at test it is discarded and z comes from the prior. The decoder is always conditioned on o.
encoder path (train only) decoder path (train + test)

The ELBO, derived

We want to maximize the conditional log-likelihood of the demonstrated chunk. It is intractable, so we lower-bound it. Introduce a variational posterior q(z|a,o) and apply Jensen's inequality:

log p(a|o) = log ∫ p(a|z,o) p(z) dz
= log 𝔼q(z|a,o)[ p(a|z,o) p(z) / q(z|a,o) ]
≥ 𝔼q(z|a,o)[ log p(a|z,o) ] − KL( q(z|a,o) ‖ p(z) )

This is the ELBO, the objective ACT maximizes (equivalently, it minimizes the negative ELBO). Two terms:

LACT = ‖a − â(z,o)‖1 + β·KL( q(z|a,o) ‖ N(0,I) )

Reparameterization trick

You cannot backprop through a random sample. So instead of sampling z ∼ N(μ,σ²) directly, write it as a deterministic function of a fixed-noise variable:

z = μ + σ ⊙ ε,    ε ∼ N(0, I)

Now the randomness lives in ε, and gradients flow cleanly into μ and σ.

At test time

Discard the encoder entirely (it needed the ground-truth chunk, which you don't have at deploy). Either sample z ∼ N(0,I) for stochastic, multimodal chunks, or set z = 0 (the prior mean) for the deterministic "most-likely" chunk — LeRobot's ACT defaults to the latter for stable rollouts. Decode with o to get â1:k, execute the chunk, slide forward, repeat.

time t → chunk predicted at t₀ → execute k≈50 actions next chunk predicted at t₁ (overlap → temporal ensemble) chunk predicted at t₂ → execute next k n_action_steps actions executed per prediction (= chunk length, NOT 1)
Rolling chunks. Overlap is what temporal ensembling averages over; the execution stride is n_action_steps.

How this maps to my build: ACT is the simplest expressive head and LeRobot's most battle-tested for cheap arms — it is my baseline. I'll keep chunk_size≈50, set n_action_steps to match, and default z = 0 for repeatable pick-place rollouts on the LeKiwi.

C · Diffusion Policy

Maps to ETH Lecture 6 · Generative Models — Janner/Du "Diffuser / Planning with Diffusion", Florence et al. "Implicit Behavioral Cloning", Wagenmaker et al. "Steering Diffusion Policies with Latent-Space RL"

Diffusion Policy treats the action chunk as something to be denoised into existence. Start from pure Gaussian noise and iteratively clean it, conditioned on the observation, until a coherent action sequence emerges. Because it samples from the learned distribution instead of regressing a mean, it represents multimodality natively.

C.1 · Forward process (fixed noising)

Take a clean action chunk a0 from the dataset and gradually add Gaussian noise over diffusion steps t = 1…N. The closed form for any step uses the cumulative product t of the noise schedule:

q(at | a0) = N( at; √ta0, (1−t) I )
equivalently   at = √ta0 + √(1−t) ε,   ε ∼ N(0, I)

As t→N, t→0 and at becomes pure noise. This process is not learned — it just defines the corruption we will reverse.

C.2 · Training objective (predict the noise)

Train a network εθ to predict the noise that was added, given the noisy chunk, the step index, and the observation conditioning o. This is the DDPM simple loss:

LDDPM(θ) = 𝔼t, a₀, ε, oεεθ(at, t, o) ‖2

C.3 · Reverse process (sampling)

To act, sample aN ∼ N(0,I) and walk the chain backward, subtracting predicted noise each step (DDPM form):

at−1 = (1/√αt)( at − (1−αt)/√(1−t) · εθ(at,t,o) ) + σtz

After N steps you arrive at a clean action chunk a0. The stochastic term σtz is exactly what lets two runs from the same observation land on different valid modes.

Why diffusion beats MSE on multimodality

An MSE head minimizes 𝔼‖aâ‖², whose optimum is the conditional mean 𝔼[a|o] — the invalid average between two modes. Diffusion instead learns a score/noise field whose reverse process draws samples from the full p(a|o). Each rollout commits to one mode; it never blends "go left" and "go right" into "crash."

reverse denoising: noise a_N → … → clean action a_0 t = N (pure noise) t = 2N/3 t = N/3 t = 0 (clean chunk) ε_θ applied each step (N≈50–100) forward q (fixed noising): a₀ → a_N reverse pθ (learned denoise): a_N → a₀
Each reverse step subtracts predicted noise; the action curve sharpens from noise into a committed motion. Pulsing dot = εθ evaluation per step.
forward noising (fixed) reverse denoising (learned)

How this maps to my build: Diffusion Policy is my strongest multimodal head and a great quality bar — but N≈50–100 network evals per chunk is heavy for the LeKiwi's async runner. I'll train it as a quality reference, then check whether flow matching gets me the same multimodality at a fraction of the inference cost.

D · Flow Matching heart of the page

This is what SmolVLA & π0 use — page 3

Flow matching keeps diffusion's expressivity but replaces the noisy, many-step stochastic chain with a deterministic transport: learn a velocity field that carries noise to data along a smooth path, then integrate it with a handful of ODE steps. Far fewer function evaluations means far faster inference — which is exactly what our async runner needs.

D.1 · Continuous normalizing flows & probability paths

Imagine a continuous time τ ∈ [0,1]. At τ=0 we have an easy distribution x0 ∼ N(0,I); at τ=1 we want the data distribution x1 (an action chunk). A probability path is a continuum of distributions connecting them, and a velocity field v(x,τ) is the flow that pushes samples along that path. If we know the right velocity field, we can transport any noise sample to a data sample by following it.

D.2 · Conditional Flow Matching with the OT (linear) path

The trick that makes this trainable: instead of matching the intractable marginal velocity, match a conditional one defined per data point. Pick a pair (noise x0, data x1) and connect them with the simplest possible path — a straight line (the optimal-transport interpolant):

xτ = (1 − τ) x0 + τx1,    τ ∈ [0,1]

Now derive the target velocity. Since xτ is linear in τ, differentiate:

dxτ / dτ = d/dτ [ (1−τ)x0 + τx1 ] = −x0 + x1 = x1x0

The target velocity is constant along the path — just the straight-line displacement x1x0. So we train a network vθ to regress it. This is the centerpiece:

LCFM(θ) = 𝔼τ, x₀, x₁, ovθ(xτ, τ, o) − (x1x0) ‖2

Every symbol:

Note the loss is a plain regression — no KL, no noise schedule, no variational posterior. That is why flow matching is famously stable to train.

D.3 · Sampling — integrate the ODE

At inference, draw noise x0 ∼ N(0,I) and integrate the learned ODE from τ=0 to τ=1. A few Euler steps suffice because the OT paths are nearly straight:

dx/ = vθ(xτ, τ, o),    Euler:   xτ+Δ = xτ + Δ·vθ(xτ, τ, o)
# Flow-matching sampling: noise -> action chunk
x = randn(chunk_shape)            # x0 ~ N(0, I)
o = encode_observation(images, state, lang)
n = 10                            # integration steps — often 4..10
dt = 1.0 / n
for i in range(n):
    tau = i * dt
    v = v_theta(x, tau, o)         # one network eval per step
    x = x + dt * v                  # Euler step along the flow
return x                           # x ~ p(action_chunk | o)

Diffusion typically needs N≈50–100 evals; flow matching needs ~4–10. For our LeKiwi async runner — where the policy must emit a fresh chunk faster than the arm exhausts the last one — that latency gap is decisive.

τ = 0 · noise N(0,I) τ = 1 · action manifold p(a | o) flow matching: straight OT paths, ~4–10 Euler steps diffusion: curved, iterative, ~50–100 steps
Animated transport: dots ride straight lines from the noise blob to the action manifold. Diffusion's path (clay, dashed) wanders and needs far more steps.
flow matching (straight, few steps) diffusion (curved, many steps)

D.4 · Discrete-token vs continuous-expert action heads

When we bolt an action head onto a VLM (page 3), there are two camps. Discrete-token heads (e.g. RT-2-style) tokenize actions into bins and let the language model autoregress them — simple to plug in, but binning loses precision and decoding is sequential. Continuous-expert heads (SmolVLA, π0) attach a small flow-matching "action expert" that outputs continuous chunks in one parallel pass — precise, multimodal, and fast. Page 3 builds the continuous-expert path on top of exactly the CFM loss above.

How this maps to my build: Flow matching is my shipping action head. Stable regression training, multimodal samples, and ~4–10 evals per chunk — that latency is what lets the LeKiwi's async runner keep the arm fed. On page 3 I attach it to SmolVLA as a continuous action expert.

E · Comparison matrix

Four heads, the same question each: can it represent more than one valid future, and what does it cost?

Action headMultimodal?Inference costTraining stabilityUsed byWhen to pick
MSE / Gaussian BC No — collapses to the mean 1 eval (cheapest) Very stable Vanilla BC baselines Only when demos are truly unimodal; never for forks.
ACT (CVAE) Partial — latent z gives some modes 1 eval per chunk (fast) Stable (β tuning) ALOHA, LeRobot ACT Best simple baseline on cheap arms; fast and reliable.
Diffusion Policy Yes — strong, native High — N≈50–100 evals Stable; schedule-sensitive Diffusion Policy, many RL/robotics works When you need max multimodal quality and can afford latency.
Flow Matching Yes — strong, native Low — ~4–10 evals Very stable (plain regression) SmolVLA, π0 VLAs & fast async control — my default ship.
Recommendation

For the ANIMA-Kiwi: start with ACT (fastest path to a working pick-place), benchmark Diffusion Policy as a quality ceiling, and ship Flow Matching via SmolVLA — it gives diffusion-level multimodality with ACT-level (or better) inference latency, which is what the async runner demands.

Lab · Action-head bake-off on the LeKiwi dataset

Lab Train ACT vs Diffusion (and cross-link Flow/SmolVLA), then decide what to ship

Goal: train two expressive heads on your recorded pick-place dataset, compare success rate + smoothness, and verify the chunking config. The flow-matching head (SmolVLA) is trained in full on page 3 — here we set up the comparison and the decision criterion.

Verify before you run

LeRobot's CLI is version-sensitive. On 0.4.x, flags can differ between releases. Run lerobot-train --help FIRST and confirm the exact names for --policy.type, --dataset.repo_id, --steps, --batch_size, --output_dir, and --policy.device before copy-pasting anything below.

  1. Confirm the CLI surface.
    # verify flag names on YOUR installed version first
    pip show lerobot | grep -i version
    lerobot-train --help | less
  2. Set your dataset handle (the pick-place set recorded on page 1).
    export HF_USER="your-hf-username"
    # dataset repo: $HF_USER/my_lekiwi_pickplace
  3. Train ACT (CVAE baseline). Keep the default chunk size (~50).
    lerobot-train \
      --policy.type=act \
      --dataset.repo_id=$HF_USER/my_lekiwi_pickplace \
      --batch_size=8 --steps=40000 \
      --output_dir=outputs/train/lekiwi_act \
      --policy.device=cuda
  4. Train Diffusion Policy (quality reference; expect slower inference).
    lerobot-train \
      --policy.type=diffusion \
      --dataset.repo_id=$HF_USER/my_lekiwi_pickplace \
      --batch_size=8 --steps=40000 \
      --output_dir=outputs/train/lekiwi_diffusion \
      --policy.device=cuda
  5. Inspect the chunking config in each checkpoint. This is the gotcha from §B.
    # n_action_steps MUST equal the chunk length (~50), not 1
    python -c "import json; c=json.load(open('outputs/train/lekiwi_act/checkpoints/last/pretrained_model/config.json')); print('chunk_size', c.get('chunk_size'), '| n_action_steps', c.get('n_action_steps'))"

    If n_action_steps is 1, expect jerky stuttering at deploy — fix it before evaluating.

  6. Roll out both on the LeKiwi (or in eval) and log two numbers per head: success rate (fraction of clean pick-places) and smoothness (mean absolute jerk = mean |Δ²a| over the trajectory). Lower jerk = smoother.
  7. Cross-link Flow Matching. The flow-matching head is SmolVLA — trained end-to-end on page 3 · VLAs & Body. Note its per-chunk inference latency there and slot it into the same success/smoothness table.
  8. Decide. Fill the table and pick your ship head.
    head        | success | smoothness(jerk) | evals/chunk | latency
    ACT         |   __%   |       __         |     1       |   __ ms
    Diffusion   |   __%   |       __         |   50-100    |   __ ms
    Flow (pg.3) |   __%   |       __         |    4-10     |   __ ms

Done when

  • You ran lerobot-train --help and confirmed the flags on your installed version.
  • ACT and Diffusion checkpoints exist under outputs/train/....
  • You verified n_action_steps ≈ chunk_size (~50) in each config.json — not 1.
  • You logged success rate and smoothness (jerk) for each head.
  • You can state, in one sentence, which action head you'll ship and WHY — explicitly trading off multimodality against inference latency for the async runner.

ETH papers — the source material

Short reads behind this page, with how each one lands in the build.

Decision Transformer Lecture 7 · Sequence Modeling

Chen et al. Reframes RL as sequence modeling: a causal transformer is conditioned on a desired return-to-go plus past states and actions, and simply predicts the next action autoregressively — no value function, no bootstrapping. Offline trajectories become "(return, state, action)" tokens, and at test time you prompt with a target return to steer behavior. Its lasting lesson for us is architectural: treat control as predicting tokens over a context window, which is exactly the mindset that makes action chunking (predict many future actions at once) natural, and that later VLAs inherit when they autoregress or transport action tokens.

Maps to my build

Justifies the sequence/chunk view of control; context for ACT's transformer and for VLA action heads on page 3.

ALOHA / ACT Lecture 7 · Sequence Modeling

Zhao et al., "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." Two contributions we use directly: cheap teleoperated bimanual hardware (ALOHA), and ACT — Action Chunking with Transformers. ACT predicts a chunk of ~50 future actions in one pass and is trained as a conditional VAE so it captures the multimodality of human demos; temporal ensembling averages overlapping chunks for smooth motion. The result: precise, contact-rich manipulation learned from a handful of demos on inexpensive arms — precisely the regime of the LeKiwi.

Maps to my build

My §B baseline head — --policy.type=act in the lab, chunk≈50, z=0 at deploy.

Diffuser / Planning with Diffusion Lecture 6 · Generative Models

Janner & Du et al. Treats planning as generation: a diffusion model is trained over whole trajectories and sampled to produce plans, with guidance (e.g. reward gradients) steering the denoising toward high-value, feasible behavior. The key idea inherited by Diffusion Policy is that denoising a full action/trajectory sequence yields coherent, multimodal plans rather than greedy single steps — and that conditioning/guidance during sampling lets you bias which mode you land in.

Maps to my build

Foundation for §C Diffusion Policy — the quality-ceiling head I benchmark.

Implicit Behavioral Cloning Lecture 6 · Generative Models

Florence et al. Argues that explicit (regression) policies are fundamentally limited on multimodal, discontinuous demos because they fit a single output. The fix is an implicit, energy-based policy: learn an energy E(o,a) and act by minimizing it, which can represent sharp, multi-valued action distributions. It is the cleanest articulation of why MSE BC fails at forks — and motivates the whole jump to expressive heads (CVAE, diffusion, flow) on this page.

Maps to my build

The "why MSE fails" argument that justifies this entire page.

← Previous
1 · Foundations
Next →
3 · VLAs & Body