A Vision-Language-Action model is the brain of the limbs: a pretrained VLM that sees and
reads instructions, bolted to an action expert that moves. We dissect SmolVLA
line by line — then run the make-or-break LeKiwi pipeline: install → calibrate → teleop → record → fine-tune → deploy → evaluate.
On this page
Part I · Theory
A · What a Vision-Language-Action model is
Maps to ETH Lecture 9 — Generalist Robot Policies. A VLA takes a pretrained
vision-language backbone (a VLM) and conditions it, at each control step, on three things:
Fig. 1 — The VLA recipe: a pretrained VLM supplies perception & semantics; an action expert turns those features into motor commands.
The recipe is deliberately lopsided: borrow a heavyweight, internet-pretrained VLM for
perception and language grounding, then attach a comparatively small action expert that you train on
robot data. You inherit “red cube vs. blue cube,” “left vs. right,” and object permanence essentially for free.
Two action-head families (callback to Page 2)
① Discrete-token actions
Bin each action dimension, map bins to vocabulary tokens, and autoregressively decode
them like text — the RT-2 / OpenVLA style. Reuses the LLM head verbatim.
Pros: trivially reuses the language stack; multimodal by construction (sample different token sequences). Cons: quantization error caps precision; sequential decode is slow (one token at a time × dims × horizon).
② Continuous expert
A small flow-matching / diffusion head emits continuous action chunks in one (or few) refinement passes —
the π0 / SmolVLA style. This is the Track 1 path.
Pros: full continuous precision; whole chunk in parallel; multimodal via the stochastic ODE/diffusion. Cons: extra expert to train; needs the flow-matching machinery from Page 2.
Property
Discrete-token (RT-2/OpenVLA)
Continuous expert (π0/SmolVLA)
Precision
limited by bin width
full continuous
Inference speed
slow (sequential AR)
fast (parallel chunk)
Multimodality
sampling token paths
stochastic flow/diffusion
Extra training
none (reuse LM head)
train an action expert
How this maps to my build: ANIMA-Kiwi uses the continuous-expert path —
SmolVLA's flow-matching head — because grasping needs sub-bin precision and we want a full action chunk per pass
so async inference can keep the wheels and gripper moving smoothly.
Paper map · Generalist robot policies (Lecture 9)
Language-conditioned imitation / Play-LMP
Lynch et al.
Learn from unstructured “play” data, then condition a single policy on a free-form language goal.
A latent plan space (LMP) ties together teleoperated play with a handful of language-labeled episodes, so the
robot generalises to instructions it never saw verbatim. The seed of “one policy, many language goals.”
Maps to my build · language-prompted picking from one model
Gato — a generalist agent
Reed et al.
One transformer, one set of weights, tokenises everything — text, images, button presses,
joint torques — into a single sequence and predicts the next token. Proves a single sequence model can span
Atari, captioning, and a real arm. The discrete-token-action lineage in its purest form.
Maps to my build · contrasts our continuous head choice
π0 / π*0.6 — flow-matching VLA
Physical Intelligence
A VLM backbone plus a dedicated flow-matching action expert emitting continuous chunks at high rate;
π*0.6 adds RL-from-experience and on-robot self-improvement for long, dexterous tasks. The direct architectural
ancestor of SmolVLA's expert. (arXiv:2511.14759)
Maps to my build · SmolVLA = a small π0
Summaries are ≤200 words and intentionally operational — read the originals before quoting numbers.
B · SmolVLA, dissected
SmolVLA is a small (~450M-class) VLA built for consumer GPUs and the LeRobot stack. Every design choice trades a
little capacity for a lot of speed — exactly what a Raspberry-Pi-bodied creature needs. Here is the whole thing,
labeled, followed by why each box exists.
Track 1 / Body data pathVLM→expert feature read (intermediate layer)skipped / unused layers
Fig. 2 — SmolVLA. Numbered boxes ①–⑥ are explained below.
Why each choice makes it small & fast
① Compact VLM backbone (SmolVLM2)
Starts from a small, efficient VLM rather than a multi-billion-parameter giant. You still inherit
real visual semantics, but at a footprint that fits a single consumer GPU for training and a modest host for inference.
① Layer-skip — read an intermediate layer
The action expert taps features from roughly the first half of the LLM layers (≈ layer N/2),
not the final layer. The upper layers are never even run for control. Cutting the LLM depth in half roughly halves
the per-step compute — the single biggest speed win.
② PixelShuffle visual-token reduction
Each frame is squeezed to ~64 visual tokens via a PixelShuffle (space-to-channel) op before
it hits the transformer. Fewer tokens → shorter sequences → quadratically cheaper attention, with little task-relevant
detail lost for tabletop manipulation.
③ Layers actually skipped
Concretely: forward pass stops at N/2; the remaining layers are dropped from the graph. Memory and
latency both fall. (Tunable — but the default is “half.”)
④ State-as-prefix
Proprioceptive state (joint angles + base velocity) is projected into a single prefix token
prepended to the sequence — cheap, and it lets every later token attend to “where the body currently is.”
⑤ Action expert = small flow-matching transformer
A separate, smaller transformer trained with the conditional flow-matching loss from Page 2.
It outputs a whole action chunk (≈ H = 50 steps) per call, in one or few denoising passes — not one scalar at a time.
⑥ Interleaved attention
The expert alternates cross-attention (into the cached VLM features) with causal self-attention
(over the action chunk). Cross-attention grounds actions in perception; self-attention keeps the chunk internally coherent.
Flow-matching action head — the recap (from Page 2)
The expert is trained to regress the velocity field of a probability flow that carries noise x0∼𝒩(0,𝐈)
to a clean action chunk x1=a, conditioned on the VLM features c:
xτ = (1−τ)·x0 + τ·x1, τ ∼ 𝒰(0,1)
𝓛CFM = 𝔼τ,x₀,a[ ‖ vθ(xτ, τ, c) − (x1 − x0) ‖2 ]
at inference: integrate dx/dτ = vθ(xτ,τ,c) from τ=0→1 in a few Euler steps ⇒ action chunk
Same loss you derived on Page 2 — here c is just the SmolVLM intermediate-layer features.
⑥+ Async inference — the load-bearing trick
Predicting a chunk takes tens of milliseconds; but a robot must emit a command every ~30–50 ms without stalling.
SmolVLA's answer is async inference: decouple predict from execute. While the body keeps
playing out the current chunk (the fast loop), a worker computes the next chunk (the slow loop) and swaps it
in before the old one runs out. This is the classic fast/slow decoupling pattern.
fast execute loop (on the body)slow predict worker (GPU host)
Fig. 3 — Async inference. The robot never waits on the network or the GPU: it always has a chunk to play while the next is being computed.
How this maps to my build: Async inference is load-bearing for our demo. LeKiwi's policy
runs on a GPU host while the body lives on a Pi; the execute loop on the Pi plays the current chunk smoothly while the
GPU computes the next. This is the same fast/slow split the orchestrator's body_client relies on
(Track 1 → Page 6). If you skip it, you get jerky stop-start motion.
Theory → build bridge
SmolVLA = (small VLM with layer-skip + token reduction) + (flow-matching action expert) + (async serving).
Every one of those three is a knob you will actually touch in Part II.
Part II · Track 1 Hands-On
The LeKiwi record → train → deploy pipeline
Verify before you run
The LeRobot CLI changes between releases — you are on 0.4.x. Flag names, entrypoints, and example
paths drift. Always run lerobot-<cmd> --help and check the official docs
(lerobot/lekiwi,
lerobot/smolvla) before running anything below.
Prefer the docs over these snippets where they disagree.
Expected--help prints the current flag set; the version reads 0.4.x. If a flag below is missing, trust --help.
Lab · Calibrate Find ports, set motor IDs, calibrate both arms
Reuse the same id everywhere
Pick one --robot.id (we use my_awesome_kiwi) and use it for setup, calibrate, host, teleop, record, evaluate.
Calibration files are keyed by id — a mismatch silently loads the wrong offsets.
Find the serial port (run, unplug the device, run again, diff):
lerobot-find-port
Set motor IDs on the LeKiwi follower — arm servos 6→1, wheels 9, 8, 7:
lerobot-setup-motors \
--robot.type=lekiwi \
--robot.port=/dev/ttyACM0 # ← from find-port
On the laptop — edit remote_ip / port at the top of the example, then run:
python examples/lekiwi/teleoperate.py
Drive it. Arm follows the leader; base is on the keyboard:
WASD move ·
ZX turn ·
RF speed up / down
Expected The follower arm mirrors the leader in real time and the base responds to keys. Latency should feel sub-100 ms on a LAN.
Lab · Record Build the dataset — the make-or-break step
Invariant B — navigation lives in the action space
Every episode must capture driving up to the object and then grasping it as ONE continuous motion.
This is non-negotiable: it lets a single SmolVLA learn approach + grasp end-to-end, so we need
no separate motion planner. If you teleport the base then grasp, the model never learns the approach.
Authenticate to the Hub (write token):
hf auth login
Keep the Pi host running (from the Teleop lab), then on the laptop edit
remote_ip, repo_id, task (the natural-language instruction),
and NB_CYCLES_CLIENT_CONNECTION in record.py, and run:
python examples/lekiwi/record.py
Recording discipline — this is where success rate is won or lost:
≥ 50 episodes total; ~10 per object position.
Cameras fixed for the whole session (don't bump the mounts).
Object always visible at episode start.
Each episode = drive to object → grasp, one fluid take (Invariant B).
Vary object position & clutter; keep the task string consistent with how you'll prompt at deploy.
Validate before training — replay the dataset on the real robot:
python examples/lekiwi/replay.py
Garbage in, garbage out
No amount of training fixes a bad dataset. If replay.py looks wrong (drift, missing frames, base not moving), re-record — do not train on it.
Expected A dataset pushed to $HF_USER/my_lekiwi_pickplace with ≥50 episodes; replay.py reproduces the recorded drive-and-grasp on the real robot.
Lab · Fine-tune Train SmolVLA on the GPU host
Ask before long runs
This is a multi-hour GPU job and downloads the smolvla_base checkpoint. Confirm before kicking it off, and check disk/VRAM headroom first.
Optional one-evening bake-off — train act or pi0 variants on the same dataset and compare success rates
(the discrete vs. continuous and ACT vs. flow-matching trade-offs from Page 2).
Expected Loss curve in W&B trending down; checkpoints under outputs/train/lekiwi_smolvla/ at each save_freq. Smoke-test the latest checkpoint before a full run.
Lab · Deploy & Evaluate Run the policy on the real Kiwi
Keep the Pi host running, then on the GPU host run the policy against the body:
python examples/lekiwi/evaluate.py \
# set checkpoint path + remote_ip inside the script / via its flags
Build an eval harness: for each object position, run 10 trials, and log per trial:
Gotcha — jerky motion
If the robot stutters, open the checkpoint's config.json and confirm
n_action_steps equals the chunk length (~50), not 1. Executing one step per inference defeats
action chunking and async serving, and produces stop-start motion.
Expected The Kiwi drives up to the prompted object and grasps it; the harness prints a success-rate table per object position with the top failure mode.
Lab · Async server Wrap the policy for async inference
track1_body/policy_server.py wraps the trained checkpoint as an async policy server:
it runs the slow SmolVLA forward pass (the PREDICT worker from Fig. 3) and streams action chunks to the body, which
plays them on the fast EXECUTE loop. This is the same interface the orchestrator's body_client calls in
Page 6 — the body's “fast” half of the creature's fast/slow split.
How this maps to my build: the server is the seam between Track 1 (Body) and Track 3 (Creature):
train here, serve async, and Page 6's orchestrator just sends a language goal + reads back chunks.
Track 1 → Track 3
Track 1 · Done when
Three hosts installed; lerobot-train --help runs and version reads 0.4.x.
Follower + leader calibrated under one shared--robot.id.
Teleop works: arm mirrors the leader, base drives on W/A/S/D.
≥ 50 episodes recorded, each a single drive-then-grasp (Invariant B); replay.py validates them.
SmolVLA fine-tuned; checkpoint smoke-tested.
Deployed via the async policy server with n_action_steps = chunk length.
Language-prompted drive-and-pick succeeds ≥ 60% over 10 trials.