Action Models — ANIMA-Kiwi

A · Recap & roadmap

Recall the compounding-error result from page 1: a behavior-cloning policy with per-step error ε drifts off the expert distribution at rate O(ε T2) over a horizon T. But there is a deeper failure even with zero noise: a unimodal head trained by MSE collapses two valid trajectories into their average, which is often invalid. The cure is a head that can represent a distribution over action chunks, not a single mean. Below are the three families we cover, all conditioned on observation o (vision + state, and later language).

Three expressive heads, one job: turn an observation into a distribution over the next ~50 actions, then sample one chunk.

latent-variable (CVAE/ACT) denoising (Diffusion) transport (Flow Matching)

Foreshadow: by page 3 we converge on flow matching, because both SmolVLA and π0 attach a flow-matching action expert to a VLM backbone. The math here is the foundation for that.

B · Action chunking & ACT

Maps to ETH Lecture 7 · Sequence Modeling — Zhao et al., "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" (ALOHA / ACT)

B.1 · Action chunking — predict k actions at once

Instead of predicting one action per observation and re-querying every step, ACT predicts a chunk of the next k actions in a single forward pass. This is the single most important trick for cheap-hardware imitation learning, and it attacks compounding error directly.

standard BC: at = π(ot) → decide every step, horizon = T

chunked: at:t+k = π(ot) → decide every k steps, effective horizon = T/k

If error compounds like O(ε T2) in the number of decision points, then cutting decision points by a factor of k cuts the compounding term by roughly k2. Chunking also lets the policy commit to a coherent motion (e.g. a full grasp) rather than re-deciding mid-reach, which is where unimodal indecision causes pauses and jitter.

Temporal ensembling

Overlapping chunks predicted at consecutive timesteps give several predictions for the same future action. ACT averages them with an exponential weight, which smooths the executed trajectory without adding latency:

ât = Σi wi at(i) / Σi wi, wi = exp(−m·i)

where i indexes how old a prediction is and m sets how fast old predictions decay.

Practical gotcha

At deploy, the runner's n_action_steps (how many actions you actually execute before re-querying the policy) must equal the chunk length — about 50 on LeRobot ACT — not 1. If you re-query every step, you throw away chunking, the policy chatters between modes, and the arm moves in jerky stutters. Always confirm n_action_steps ≈ chunk_size in the checkpoint config (see the lab).

B.2 · ACT is a conditional VAE

A chunk of demonstrated actions is multimodal: the same observation can correspond to several valid motions, and within a chunk there is natural variation in style. ACT models p(achunk | o) with a latent variable z — a conditional VAE — so the decoder can produce different coherent chunks for the same observation by varying z.

At training the encoder sees the true chunk; at test it is discarded and z comes from the prior. The decoder is always conditioned on o.

encoder path (train only) decoder path (train + test)

The ELBO, derived

We want to maximize the conditional log-likelihood of the demonstrated chunk. It is intractable, so we lower-bound it. Introduce a variational posterior q(z|a,o) and apply Jensen's inequality:

log p(a|o) = log ∫ p(a|z,o) p(z) dz

= log 𝔼q(z|a,o)[ p(a|z,o) p(z) / q(z|a,o) ]

≥ 𝔼q(z|a,o)[ log p(a|z,o) ] − KL( q(z|a,o) ‖ p(z) )

This is the ELBO, the objective ACT maximizes (equivalently, it minimizes the negative ELBO). Two terms:

Reconstruction 𝔼_q[log p(a|z,o)] — decode z and o back into the chunk. With a Gaussian/Laplace decoder this is just an L1/L2 reconstruction loss on the predicted actions.
KL regularization KL(q‖p) — pulls the posterior toward the prior p(z) = N(0,I) so the latent space is smooth and samplable at test time. ACT weights it with a small β (β-VAE).

LACT = ‖a − â(z,o)‖1 + β·KL( q(z|a,o) ‖ N(0,I) )

Reparameterization trick

You cannot backprop through a random sample. So instead of sampling z ∼ N(μ,σ²) directly, write it as a deterministic function of a fixed-noise variable:

z = μ + σ ⊙ ε, ε ∼ N(0, I)

Now the randomness lives in ε, and gradients flow cleanly into μ and σ.

At test time

Discard the encoder entirely (it needed the ground-truth chunk, which you don't have at deploy). Either sample z ∼ N(0,I) for stochastic, multimodal chunks, or set z = 0 (the prior mean) for the deterministic "most-likely" chunk — LeRobot's ACT defaults to the latter for stable rollouts. Decode with o to get â1:k, execute the chunk, slide forward, repeat.

Rolling chunks. Overlap is what temporal ensembling averages over; the execution stride is n_action_steps.

How this maps to my build: ACT is the simplest expressive head and LeRobot's most battle-tested for cheap arms — it is my baseline. I'll keep chunk_size≈50, set n_action_steps to match, and default z = 0 for repeatable pick-place rollouts on the LeKiwi.

C · Diffusion Policy

Maps to ETH Lecture 6 · Generative Models — Janner/Du "Diffuser / Planning with Diffusion", Florence et al. "Implicit Behavioral Cloning", Wagenmaker et al. "Steering Diffusion Policies with Latent-Space RL"

Diffusion Policy treats the action chunk as something to be denoised into existence. Start from pure Gaussian noise and iteratively clean it, conditioned on the observation, until a coherent action sequence emerges. Because it samples from the learned distribution instead of regressing a mean, it represents multimodality natively.

C.1 · Forward process (fixed noising)

Take a clean action chunk a0 from the dataset and gradually add Gaussian noise over diffusion steps t = 1…N. The closed form for any step uses the cumulative product ᾱt of the noise schedule:

q(at | a0) = N( at; √ᾱt a0, (1−ᾱt) I )

equivalently at = √ᾱt a0 + √(1−ᾱt) ε, ε ∼ N(0, I)

As t→N, ᾱt→0 and at becomes pure noise. This process is not learned — it just defines the corruption we will reverse.

C.2 · Training objective (predict the noise)

Train a network εθ to predict the noise that was added, given the noisy chunk, the step index, and the observation conditioning o. This is the DDPM simple loss:

LDDPM(θ) = 𝔼t, a₀, ε, o ‖ ε − εθ(at, t, o) ‖2

at — the noised chunk built from a0 and ε via the forward formula.
t — diffusion step (embedded), tells the net how much noise to expect.
o — observation conditioning (vision + state); the policy never denoises blind.

C.3 · Reverse process (sampling)

To act, sample aN ∼ N(0,I) and walk the chain backward, subtracting predicted noise each step (DDPM form):

at−1 = (1/√αt)( at − (1−αt)/√(1−ᾱt) · εθ(at,t,o) ) + σt z

After N steps you arrive at a clean action chunk a0. The stochastic term σt z is exactly what lets two runs from the same observation land on different valid modes.

Why diffusion beats MSE on multimodality

An MSE head minimizes 𝔼‖a−â‖², whose optimum is the conditional mean 𝔼[a|o] — the invalid average between two modes. Diffusion instead learns a score/noise field whose reverse process draws samples from the full p(a|o). Each rollout commits to one mode; it never blends "go left" and "go right" into "crash."

Each reverse step subtracts predicted noise; the action curve sharpens from noise into a committed motion. Pulsing dot = εθ evaluation per step.

forward noising (fixed) reverse denoising (learned)

How this maps to my build: Diffusion Policy is my strongest multimodal head and a great quality bar — but N≈50–100 network evals per chunk is heavy for the LeKiwi's async runner. I'll train it as a quality reference, then check whether flow matching gets me the same multimodality at a fraction of the inference cost.

D · Flow Matching heart of the page

This is what SmolVLA & π0 use — page 3

Flow matching keeps diffusion's expressivity but replaces the noisy, many-step stochastic chain with a deterministic transport: learn a velocity field that carries noise to data along a smooth path, then integrate it with a handful of ODE steps. Far fewer function evaluations means far faster inference — which is exactly what our async runner needs.

D.1 · Continuous normalizing flows & probability paths

Imagine a continuous time τ ∈ [0,1]. At τ=0 we have an easy distribution x0 ∼ N(0,I); at τ=1 we want the data distribution x1 (an action chunk). A probability path is a continuum of distributions connecting them, and a velocity field v(x,τ) is the flow that pushes samples along that path. If we know the right velocity field, we can transport any noise sample to a data sample by following it.

D.2 · Conditional Flow Matching with the OT (linear) path

The trick that makes this trainable: instead of matching the intractable marginal velocity, match a conditional one defined per data point. Pick a pair (noise x0, data x1) and connect them with the simplest possible path — a straight line (the optimal-transport interpolant):

xτ = (1 − τ) x0 + τ x1, τ ∈ [0,1]

Now derive the target velocity. Since xτ is linear in τ, differentiate:

dxτ / dτ = d/dτ [ (1−τ)x0 + τ x1 ] = −x0 + x1 = x1 − x0

The target velocity is constant along the path — just the straight-line displacement x1−x0. So we train a network vθ to regress it. This is the centerpiece:

LCFM(θ) = 𝔼τ, x₀, x₁, o ‖ vθ(xτ, τ, o) − (x1 − x0) ‖2

Every symbol:

τ ∼ U[0,1] — random time along the path.
x0 ∼ N(0,I) — a fresh noise sample (the source).
x1 — a real action chunk from the dataset (the target).
xτ = (1−τ)x0 + τ x1 — the interpolated point fed to the net.
o — observation/conditioning: vision + language + proprioceptive state.
vθ — the learned velocity field; target is the constant x1−x0.

Note the loss is a plain regression — no KL, no noise schedule, no variational posterior. That is why flow matching is famously stable to train.

D.3 · Sampling — integrate the ODE

At inference, draw noise x0 ∼ N(0,I) and integrate the learned ODE from τ=0 to τ=1. A few Euler steps suffice because the OT paths are nearly straight:

dx/dτ = vθ(xτ, τ, o), Euler: xτ+Δ = xτ + Δ·vθ(xτ, τ, o)

# Flow-matching sampling: noise -> action chunk
x = randn(chunk_shape)            # x0 ~ N(0, I)
o = encode_observation(images, state, lang)
n = 10                            # integration steps — often 4..10
dt = 1.0 / n
for i in range(n):
    tau = i * dt
    v = v_theta(x, tau, o)         # one network eval per step
    x = x + dt * v                  # Euler step along the flow
return x                           # x ~ p(action_chunk | o)

Diffusion typically needs N≈50–100 evals; flow matching needs ~4–10. For our LeKiwi async runner — where the policy must emit a fresh chunk faster than the arm exhausts the last one — that latency gap is decisive.

Animated transport: dots ride straight lines from the noise blob to the action manifold. Diffusion's path (clay, dashed) wanders and needs far more steps.

flow matching (straight, few steps) diffusion (curved, many steps)

D.4 · Discrete-token vs continuous-expert action heads

When we bolt an action head onto a VLM (page 3), there are two camps. Discrete-token heads (e.g. RT-2-style) tokenize actions into bins and let the language model autoregress them — simple to plug in, but binning loses precision and decoding is sequential. Continuous-expert heads (SmolVLA, π0) attach a small flow-matching "action expert" that outputs continuous chunks in one parallel pass — precise, multimodal, and fast. Page 3 builds the continuous-expert path on top of exactly the CFM loss above.

How this maps to my build: Flow matching is my shipping action head. Stable regression training, multimodal samples, and ~4–10 evals per chunk — that latency is what lets the LeKiwi's async runner keep the arm fed. On page 3 I attach it to SmolVLA as a continuous action expert.

E · Comparison matrix

Four heads, the same question each: can it represent more than one valid future, and what does it cost?

Action head	Multimodal?	Inference cost	Training stability	Used by	When to pick
MSE / Gaussian BC	No — collapses to the mean	1 eval (cheapest)	Very stable	Vanilla BC baselines	Only when demos are truly unimodal; never for forks.
ACT (CVAE)	Partial — latent z gives some modes	1 eval per chunk (fast)	Stable (β tuning)	ALOHA, LeRobot ACT	Best simple baseline on cheap arms; fast and reliable.
Diffusion Policy	Yes — strong, native	High — N≈50–100 evals	Stable; schedule-sensitive	Diffusion Policy, many RL/robotics works	When you need max multimodal quality and can afford latency.
Flow Matching	Yes — strong, native	Low — ~4–10 evals	Very stable (plain regression)	SmolVLA, π0	VLAs & fast async control — my default ship.

Recommendation

For the ANIMA-Kiwi: start with ACT (fastest path to a working pick-place), benchmark Diffusion Policy as a quality ceiling, and ship Flow Matching via SmolVLA — it gives diffusion-level multimodality with ACT-level (or better) inference latency, which is what the async runner demands.

Lab · Action-head bake-off on the LeKiwi dataset

Lab Train ACT vs Diffusion (and cross-link Flow/SmolVLA), then decide what to ship

Goal: train two expressive heads on your recorded pick-place dataset, compare success rate + smoothness, and verify the chunking config. The flow-matching head (SmolVLA) is trained in full on page 3 — here we set up the comparison and the decision criterion.

Verify before you run

LeRobot's CLI is version-sensitive. On 0.4.x, flags can differ between releases. Run lerobot-train --help FIRST and confirm the exact names for --policy.type, --dataset.repo_id, --steps, --batch_size, --output_dir, and --policy.device before copy-pasting anything below.

Confirm the CLI surface.

# verify flag names on YOUR installed version first
pip show lerobot | grep -i version
lerobot-train --help | less

Set your dataset handle (the pick-place set recorded on page 1).

export HF_USER="your-hf-username"
# dataset repo: $HF_USER/my_lekiwi_pickplace

Train ACT (CVAE baseline). Keep the default chunk size (~50).

lerobot-train \
  --policy.type=act \
  --dataset.repo_id=$HF_USER/my_lekiwi_pickplace \
  --batch_size=8 --steps=40000 \
  --output_dir=outputs/train/lekiwi_act \
  --policy.device=cuda

Train Diffusion Policy (quality reference; expect slower inference).

lerobot-train \
  --policy.type=diffusion \
  --dataset.repo_id=$HF_USER/my_lekiwi_pickplace \
  --batch_size=8 --steps=40000 \
  --output_dir=outputs/train/lekiwi_diffusion \
  --policy.device=cuda

Inspect the chunking config in each checkpoint. This is the gotcha from §B.

# n_action_steps MUST equal the chunk length (~50), not 1
python -c "import json; c=json.load(open('outputs/train/lekiwi_act/checkpoints/last/pretrained_model/config.json')); print('chunk_size', c.get('chunk_size'), '| n_action_steps', c.get('n_action_steps'))"

If n_action_steps is 1, expect jerky stuttering at deploy — fix it before evaluating.

Roll out both on the LeKiwi (or in eval) and log two numbers per head: success rate (fraction of clean pick-places) and smoothness (mean absolute jerk = mean |Δ²a| over the trajectory). Lower jerk = smoother.
Cross-link Flow Matching. The flow-matching head is SmolVLA — trained end-to-end on page 3 · VLAs & Body. Note its per-chunk inference latency there and slot it into the same success/smoothness table.

Decide. Fill the table and pick your ship head.

head        | success | smoothness(jerk) | evals/chunk | latency
ACT         |   __%   |       __         |     1       |   __ ms
Diffusion   |   __%   |       __         |   50-100    |   __ ms
Flow (pg.3) |   __%   |       __         |    4-10     |   __ ms

Done when

You ran lerobot-train --help and confirmed the flags on your installed version.
ACT and Diffusion checkpoints exist under outputs/train/....
You verified n_action_steps ≈ chunk_size (~50) in each config.json — not 1.
You logged success rate and smoothness (jerk) for each head.
You can state, in one sentence, which action head you'll ship and WHY — explicitly trading off multimodality against inference latency for the async runner.

ETH papers — the source material

Short reads behind this page, with how each one lands in the build.

Decision Transformer Lecture 7 · Sequence Modeling

Chen et al. Reframes RL as sequence modeling: a causal transformer is conditioned on a desired return-to-go plus past states and actions, and simply predicts the next action autoregressively — no value function, no bootstrapping. Offline trajectories become "(return, state, action)" tokens, and at test time you prompt with a target return to steer behavior. Its lasting lesson for us is architectural: treat control as predicting tokens over a context window, which is exactly the mindset that makes action chunking (predict many future actions at once) natural, and that later VLAs inherit when they autoregress or transport action tokens.

Maps to my build

Justifies the sequence/chunk view of control; context for ACT's transformer and for VLA action heads on page 3.

ALOHA / ACT Lecture 7 · Sequence Modeling

Zhao et al., "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." Two contributions we use directly: cheap teleoperated bimanual hardware (ALOHA), and ACT — Action Chunking with Transformers. ACT predicts a chunk of ~50 future actions in one pass and is trained as a conditional VAE so it captures the multimodality of human demos; temporal ensembling averages overlapping chunks for smooth motion. The result: precise, contact-rich manipulation learned from a handful of demos on inexpensive arms — precisely the regime of the LeKiwi.

Maps to my build

My §B baseline head — --policy.type=act in the lab, chunk≈50, z=0 at deploy.

Diffuser / Planning with Diffusion Lecture 6 · Generative Models

Janner & Du et al. Treats planning as generation: a diffusion model is trained over whole trajectories and sampled to produce plans, with guidance (e.g. reward gradients) steering the denoising toward high-value, feasible behavior. The key idea inherited by Diffusion Policy is that denoising a full action/trajectory sequence yields coherent, multimodal plans rather than greedy single steps — and that conditioning/guidance during sampling lets you bias which mode you land in.

Maps to my build

Foundation for §C Diffusion Policy — the quality-ceiling head I benchmark.

Implicit Behavioral Cloning Lecture 6 · Generative Models

Florence et al. Argues that explicit (regression) policies are fundamentally limited on multimodal, discontinuous demos because they fit a single output. The fix is an implicit, energy-based policy: learn an energy E(o,a) and act by minimizing it, which can represent sharp, multi-valued action distributions. It is the cleanest articulation of why MSE BC fails at forks — and motivates the whole jump to expressive heads (CVAE, diffusion, flow) on this page.

Maps to my build

The "why MSE fails" argument that justifies this entire page.