VLAs & The Body — ANIMA-Kiwi

On this page

Part I · Theory

A · What a Vision-Language-Action model is

Maps to ETH Lecture 9 — Generalist Robot Policies. A VLA takes a pretrained vision-language backbone (a VLM) and conditions it, at each control step, on three things:

Fig. 1 — The VLA recipe: a pretrained VLM supplies perception & semantics; an action expert turns those features into motor commands.

The recipe is deliberately lopsided: borrow a heavyweight, internet-pretrained VLM for perception and language grounding, then attach a comparatively small action expert that you train on robot data. You inherit “red cube vs. blue cube,” “left vs. right,” and object permanence essentially for free.

Two action-head families (callback to Page 2)

① Discrete-token actions

Bin each action dimension, map bins to vocabulary tokens, and autoregressively decode them like text — the RT-2 / OpenVLA style. Reuses the LLM head verbatim.

Pros: trivially reuses the language stack; multimodal by construction (sample different token sequences).
Cons: quantization error caps precision; sequential decode is slow (one token at a time × dims × horizon).

② Continuous expert

A small flow-matching / diffusion head emits continuous action chunks in one (or few) refinement passes — the π0 / SmolVLA style. This is the Track 1 path.

Pros: full continuous precision; whole chunk in parallel; multimodal via the stochastic ODE/diffusion.
Cons: extra expert to train; needs the flow-matching machinery from Page 2.

Property	Discrete-token (RT-2/OpenVLA)	Continuous expert (π0/SmolVLA)
Precision	limited by bin width	full continuous
Inference speed	slow (sequential AR)	fast (parallel chunk)
Multimodality	sampling token paths	stochastic flow/diffusion
Extra training	none (reuse LM head)	train an action expert

How this maps to my build: ANIMA-Kiwi uses the continuous-expert path — SmolVLA's flow-matching head — because grasping needs sub-bin precision and we want a full action chunk per pass so async inference can keep the wheels and gripper moving smoothly.

Paper map · Generalist robot policies (Lecture 9)

Language-conditioned imitation / Play-LMP

Lynch et al.

Learn from unstructured “play” data, then condition a single policy on a free-form language goal. A latent plan space (LMP) ties together teleoperated play with a handful of language-labeled episodes, so the robot generalises to instructions it never saw verbatim. The seed of “one policy, many language goals.”

Maps to my build · language-prompted picking from one model

Gato — a generalist agent

Reed et al.

One transformer, one set of weights, tokenises everything — text, images, button presses, joint torques — into a single sequence and predicts the next token. Proves a single sequence model can span Atari, captioning, and a real arm. The discrete-token-action lineage in its purest form.

Maps to my build · contrasts our continuous head choice

π0 / π*0.6 — flow-matching VLA

Physical Intelligence

A VLM backbone plus a dedicated flow-matching action expert emitting continuous chunks at high rate; π*0.6 adds RL-from-experience and on-robot self-improvement for long, dexterous tasks. The direct architectural ancestor of SmolVLA's expert. (arXiv:2511.14759)

Maps to my build · SmolVLA = a small π0

Summaries are ≤200 words and intentionally operational — read the originals before quoting numbers.

B · SmolVLA, dissected

SmolVLA is a small (~450M-class) VLA built for consumer GPUs and the LeRobot stack. Every design choice trades a little capacity for a lot of speed — exactly what a Raspberry-Pi-bodied creature needs. Here is the whole thing, labeled, followed by why each box exists.

Track 1 / Body data path VLM→expert feature read (intermediate layer) skipped / unused layers

Fig. 2 — SmolVLA. Numbered boxes ①–⑥ are explained below.

Why each choice makes it small & fast

① Compact VLM backbone (SmolVLM2)

Starts from a small, efficient VLM rather than a multi-billion-parameter giant. You still inherit real visual semantics, but at a footprint that fits a single consumer GPU for training and a modest host for inference.

① Layer-skip — read an intermediate layer

The action expert taps features from roughly the first half of the LLM layers (≈ layer N/2), not the final layer. The upper layers are never even run for control. Cutting the LLM depth in half roughly halves the per-step compute — the single biggest speed win.

② PixelShuffle visual-token reduction

Each frame is squeezed to ~64 visual tokens via a PixelShuffle (space-to-channel) op before it hits the transformer. Fewer tokens → shorter sequences → quadratically cheaper attention, with little task-relevant detail lost for tabletop manipulation.

③ Layers actually skipped

Concretely: forward pass stops at N/2; the remaining layers are dropped from the graph. Memory and latency both fall. (Tunable — but the default is “half.”)

④ State-as-prefix

Proprioceptive state (joint angles + base velocity) is projected into a single prefix token prepended to the sequence — cheap, and it lets every later token attend to “where the body currently is.”

⑤ Action expert = small flow-matching transformer

A separate, smaller transformer trained with the conditional flow-matching loss from Page 2. It outputs a whole action chunk (≈ H = 50 steps) per call, in one or few denoising passes — not one scalar at a time.

⑥ Interleaved attention

The expert alternates cross-attention (into the cached VLM features) with causal self-attention (over the action chunk). Cross-attention grounds actions in perception; self-attention keeps the chunk internally coherent.

Flow-matching action head — the recap (from Page 2)

The expert is trained to regress the velocity field of a probability flow that carries noise x0∼𝒩(0,𝐈) to a clean action chunk x1=a, conditioned on the VLM features c:

xτ = (1−τ)·x0 + τ·x1, τ ∼ 𝒰(0,1)

𝓛CFM = 𝔼τ,x₀,a[ ‖ vθ(xτ, τ, c) − (x1 − x0) ‖2 ]

at inference: integrate dx/dτ = vθ(xτ,τ,c) from τ=0→1 in a few Euler steps ⇒ action chunk

Same loss you derived on Page 2 — here c is just the SmolVLM intermediate-layer features.

⑥+ Async inference — the load-bearing trick

Predicting a chunk takes tens of milliseconds; but a robot must emit a command every ~30–50 ms without stalling. SmolVLA's answer is async inference: decouple predict from execute. While the body keeps playing out the current chunk (the fast loop), a worker computes the next chunk (the slow loop) and swaps it in before the old one runs out. This is the classic fast/slow decoupling pattern.

fast execute loop (on the body) slow predict worker (GPU host)

Fig. 3 — Async inference. The robot never waits on the network or the GPU: it always has a chunk to play while the next is being computed.

How this maps to my build: Async inference is load-bearing for our demo. LeKiwi's policy runs on a GPU host while the body lives on a Pi; the execute loop on the Pi plays the current chunk smoothly while the GPU computes the next. This is the same fast/slow split the orchestrator's body_client relies on (Track 1 → Page 6). If you skip it, you get jerky stop-start motion.

Theory → build bridge SmolVLA = (small VLM with layer-skip + token reduction) + (flow-matching action expert) + (async serving). Every one of those three is a knob you will actually touch in Part II.

Part II · Track 1 Hands-On

The LeKiwi record → train → deploy pipeline

Verify before you run The LeRobot CLI changes between releases — you are on 0.4.x. Flag names, entrypoints, and example paths drift. Always run lerobot-<cmd> --help and check the official docs (lerobot/lekiwi, lerobot/smolvla) before running anything below. Prefer the docs over these snippets where they disagree.

Track 1 step make-or-break step (data) iterate / feedback

Fig. 4 — The Track 1 pipeline. Data quality at Record dominates everything downstream.

Three hosts in play throughout:

Laptop

Holds the SO-101 leader arm for teleop; sends commands to the Pi over the network.

Raspberry Pi

On the LeKiwi follower body — runs the lekiwi_host, drives wheels + arm, streams cameras.

GPU host

Fine-tunes SmolVLA and later serves the policy for async deployment.

Lab · Install Set up LeRobot on three hosts

Create the env (every host):

conda create -n lerobot python=3.10 -y
conda activate lerobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

Laptop + Pi — robot extras (Feetech SDK + ZeroMQ for the LeKiwi link):
```
pip install -e ".[lekiwi]"
```
GPU host — SmolVLA training extras:
```
pip install -e ".[smolvla]"
```

Sanity check the CLI is on your PATH and the version is what you expect:

lerobot-train --help
python -c "import lerobot, sys; print(lerobot.__version__)"

Expected --help prints the current flag set; the version reads 0.4.x. If a flag below is missing, trust --help.

Lab · Calibrate Find ports, set motor IDs, calibrate both arms

Reuse the same id everywhere Pick one --robot.id (we use my_awesome_kiwi) and use it for setup, calibrate, host, teleop, record, evaluate. Calibration files are keyed by id — a mismatch silently loads the wrong offsets.

Find the serial port (run, unplug the device, run again, diff):
```
lerobot-find-port
```

Set motor IDs on the LeKiwi follower — arm servos 6→1, wheels 9, 8, 7:

lerobot-setup-motors \
  --robot.type=lekiwi \
  --robot.port=/dev/ttyACM0   # ← from find-port

Calibrate the follower (on the Pi):

lerobot-calibrate \
  --robot.type=lekiwi \
  --robot.id=my_awesome_kiwi

Calibrate the leader (on the laptop — SO-101 leader):

lerobot-calibrate \
  --teleop.type=so101_leader \
  --teleop.port=/dev/ttyACM0 \
  --teleop.id=my_awesome_kiwi_leader

Expected Each motor sweeps to its range limits; a calibration JSON is written under the LeRobot calibration dir, keyed by the id.

Lab · Teleop Drive the body from the leader arm

On the Pi — start the host process (this is the body server):

python -m lerobot.robots.lekiwi.lekiwi_host \
  --robot.id=my_awesome_kiwi

On the laptop — edit remote_ip / port at the top of the example, then run:
```
python examples/lekiwi/teleoperate.py
```
Drive it. Arm follows the leader; base is on the keyboard:
WASD move · ZX turn · RF speed up / down

Expected The follower arm mirrors the leader in real time and the base responds to keys. Latency should feel sub-100 ms on a LAN.

Lab · Record Build the dataset — the make-or-break step

Invariant B — navigation lives in the action space Every episode must capture driving up to the object and then grasping it as ONE continuous motion. This is non-negotiable: it lets a single SmolVLA learn approach + grasp end-to-end, so we need no separate motion planner. If you teleport the base then grasp, the model never learns the approach.

Authenticate to the Hub (write token):
```
hf auth login
```
Keep the Pi host running (from the Teleop lab), then on the laptop edit remote_ip, repo_id, task (the natural-language instruction), and NB_CYCLES_CLIENT_CONNECTION in record.py, and run:
```
python examples/lekiwi/record.py
```
Recording discipline — this is where success rate is won or lost:
- ≥ 50 episodes total; ~10 per object position.
- Cameras fixed for the whole session (don't bump the mounts).
- Object always visible at episode start.
- Each episode = drive to object → grasp, one fluid take (Invariant B).
- Vary object position & clutter; keep the task string consistent with how you'll prompt at deploy.
Validate before training — replay the dataset on the real robot:
```
python examples/lekiwi/replay.py
```

Garbage in, garbage out No amount of training fixes a bad dataset. If replay.py looks wrong (drift, missing frames, base not moving), re-record — do not train on it.

Expected A dataset pushed to $HF_USER/my_lekiwi_pickplace with ≥50 episodes; replay.py reproduces the recorded drive-and-grasp on the real robot.

Lab · Fine-tune Train SmolVLA on the GPU host

Ask before long runs This is a multi-hour GPU job and downloads the smolvla_base checkpoint. Confirm before kicking it off, and check disk/VRAM headroom first.

Launch fine-tuning from the pretrained base:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=$HF_USER/my_lekiwi_pickplace \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=outputs/train/lekiwi_smolvla \
  --job_name=lekiwi_smolvla \
  --policy.device=cuda \
  --save_freq=5000 \
  --wandb.enable=true

Low-VRAM fallback — LoRA via PEFT (adds adapters, raises LR):

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=$HF_USER/my_lekiwi_pickplace \
  --batch_size=64 --steps=20000 \
  --output_dir=outputs/train/lekiwi_smolvla_lora \
  --job_name=lekiwi_smolvla_lora \
  --policy.device=cuda --save_freq=5000 \
  --peft.method_type=LORA --peft.r=64 --optimizer.lr=1e-3

Optional one-evening bake-off — train act or pi0 variants on the same dataset and compare success rates (the discrete vs. continuous and ACT vs. flow-matching trade-offs from Page 2).

Expected Loss curve in W&B trending down; checkpoints under outputs/train/lekiwi_smolvla/ at each save_freq. Smoke-test the latest checkpoint before a full run.

Lab · Deploy & Evaluate Run the policy on the real Kiwi

Keep the Pi host running, then on the GPU host run the policy against the body:

python examples/lekiwi/evaluate.py \
  # set checkpoint path + remote_ip inside the script / via its flags

Build an eval harness: for each object position, run 10 trials, and log per trial:
- success / fail (object grasped & lifted),
- the dominant failure mode (missed approach, slipped grasp, wrong object, stalled).
Report a success rate per position and overall.

Gotcha — jerky motion If the robot stutters, open the checkpoint's config.json and confirm n_action_steps equals the chunk length (~50), not 1. Executing one step per inference defeats action chunking and async serving, and produces stop-start motion.

Expected The Kiwi drives up to the prompted object and grasps it; the harness prints a success-rate table per object position with the top failure mode.

Lab · Async server Wrap the policy for async inference

track1_body/policy_server.py wraps the trained checkpoint as an async policy server: it runs the slow SmolVLA forward pass (the PREDICT worker from Fig. 3) and streams action chunks to the body, which plays them on the fast EXECUTE loop. This is the same interface the orchestrator's body_client calls in Page 6 — the body's “fast” half of the creature's fast/slow split.

How this maps to my build: the server is the seam between Track 1 (Body) and Track 3 (Creature): train here, serve async, and Page 6's orchestrator just sends a language goal + reads back chunks. Track 1 → Track 3

Track 1 · Done when

Three hosts installed; lerobot-train --help runs and version reads 0.4.x.
Follower + leader calibrated under one shared --robot.id.
Teleop works: arm mirrors the leader, base drives on W/A/S/D.
≥ 50 episodes recorded, each a single drive-then-grasp (Invariant B); replay.py validates them.
SmolVLA fine-tuned; checkpoint smoke-tested.
Deployed via the async policy server with n_action_steps = chunk length.
Language-prompted drive-and-pick succeeds ≥ 60% over 10 trials.

References

Pipeline & models

Papers

ACT / ALOHA — arXiv:2304.13705
π*0.6 — arXiv:2511.14759
Lynch et al. — Language-conditioned imitation / Play-LMP
Reed et al. — Gato, a generalist agent