A · What a world model is, and why robots need one
A world model is a learned simulator of the environment's dynamics. Give it the current latent state and a candidate action, and it predicts the next state, the reward, and whether the episode continues — without touching the real world. Once you have that, the agent can generate millions of trajectories by imagination and train its policy on them.
For robotics this is the whole game, because real experience is the scarcest resource you have:
| Axis | Model-free (e.g. PPO on raw experience) | Model-based (world model) |
|---|---|---|
| What it learns | A policy/value directly from reward. | The dynamics; the policy is trained on imagined rollouts. |
| Sample efficiency | Low — needs huge real interaction. | High — reuses each real sample to fit the model, then dreams. |
| Failure mode | Burns real episodes to explore. | Compounding model error; the dream drifts from reality. |
| Robotics fit | Painful — hardware is the bottleneck. | Natural — minimise real interaction, maximise GPU. |
B · The RSSM — Recurrent State-Space Model
The Dreamer family's core idea is to split the latent state into two paths so the model gets the best of both worlds:
- A deterministic recurrent path ht — a GRU that carries a reliable memory of everything so far: ht = f(ht−1, zt−1, at−1).
- A stochastic latent zt — captures what the deterministic path cannot know for sure (sensor noise, partial observability), modelled as a distribution.
The combined model state is st = (ht, zt). Everything is decoded from that pair. Four learned components do the work:
C · The learning objective — an ELBO that makes prior match posterior
The world model is a sequential latent-variable model trained by maximising a variational lower bound (ELBO) — equivalently, minimising a loss with two parts: reconstruct what you saw, and make the prior predict the posterior.
Teal term = reconstruction (the model must explain the observation, reward, and continue flag). Clay term = the KL that ties the prior to the posterior.
In words: the reconstruction term says the model's decoded image, reward, and continue signal — read off the latent — must match what actually happened. That forces the latent z to carry real information about the world.
The KL term is the heart of it. The posterior q is allowed to look at the image and so produces a good z. The prior p must produce the same distribution without the image, using only the recurrent state ht. Minimising KL(q ‖ p) trains the prior to predict the next latent blind. That is precisely the capability you exploit at imagination time: roll the prior forward in latent space, decoding rewards as you go, and you never need a pixel again.
D · Latent imagination — training the agent inside the dream
Once the world model is trained, we freeze it and train an actor–critic entirely on rollouts the model generates. Starting from real latent states (encoded from a replay buffer), we unroll the prior forward: the actor proposes at, the prior gives zt+1, the reward head gives r̂t, and the critic estimates returns that back up the chain. No pixels. No real robot. Pure latent space.
This is "background reasoning" in the most literal sense: the agent improves its policy by thinking through consequences in a compressed internal model, the way you rehearse a tricky move in your head before attempting it. It is planning and learning decoupled from acting.
E · World models for manipulation & the frontier
Two research directions are converging on the same bet — predicting the future is the universal pretext task — and they bracket where VLAs and world models meet.
UniPi — text-guided video as a universal policy
pixel-space planningReframes decision-making as conditional video generation: given a language goal and the current frame, a diffusion model generates the future video of the task being done; an inverse-dynamics model then reads actions out of the predicted frames. Because it plans in pixels, one model transfers across embodiments and tasks — the "world model" and the "policy" become the same generative object. The cost is heavy generation and the usual video-prediction artifacts.
Scalable world models (Dreamer line)
latent-space, recentPushes the RSSM recipe to scale: a more capable latent world model trained on diverse data, with agents learned inside it via imagination — the line that took Dreamer from toy control to broad, long-horizon tasks. The headline is generality and sample efficiency from a single learned model rather than bespoke simulators. See arxiv.org/abs/2509.24527.
DreamZero — world-model frontier
frontierRepresents the cutting edge of dreaming agents — pushing latent imagination toward stronger, more autonomous policy learning with minimal real interaction, blurring the remaining line between a "world model" and a "general agent." The takeaway for us is directional rather than reproducible: the field is consolidating on learned models as the training substrate, with VLAs and world models steadily merging into a single predict-the-future objective.
Card summaries are kept under ~200 words each and describe the idea; verify exact titles/results against the source before citing in writing.
So where could this take ANIMA-Kiwi beyond the sprint? Honestly: not far in 14 days, but the path is clear. A learned world model would let us do imagination-based data augmentation (dream extra grasp attempts instead of collecting them by hand) and cheap success prediction (roll the model forward a few steps to ask "will this grasp succeed?" before committing the arm). Those are exactly the upgrades you reach for once the creature works and you have more than two weeks.
F · How this maps to my build — the three takeaways
The prior/posterior split is the predict-without-observation idea our async inference exploits: act on a predicted state, correct when perception lands.
RSSM clarifies why SmolVLA conditions on both proprioceptive state and vision — the internal state must carry enough dynamics to act, not just classify a frame.
World models are the "more than 14 days" tier: imagination-based augmentation and grasp-success prediction, built on the very recipe taught here.
Small, runnable, modest compute. No robot. The goal is mechanical understanding of imagination — not a trained agent.
- Pick a toy environment. Use something tiny — a small gridworld, the classic
Pendulum-v1, or the Dreamer tutorial's default task. Either run a vetted reference DreamerV3 implementation, or build a stripped-down RSSM yourself (a GRU for ht, two small MLPs for posterior q and prior p, and three decoder heads for obs/reward/continue). Collect a few thousand transitions into a replay buffer.# sanity check the env first import gymnasium as gym env = gym.make("Pendulum-v1") obs, _ = env.reset() # roll a random policy to fill a tiny replay buffer for _ in range(2000): a = env.action_space.sample() obs, r, term, trunc, _ = env.step(a) buffer.add(obs, a, r, term or trunc)
- Train the RSSM on the transitions. Optimise the ELBO from section C
(reconstruction + KL with KL-balancing + free bits). Then verify the model understands the world:
feed real observations through the posterior and decode — compare reconstructed vs ground-truth
observations side by side. If reconstructions are crisp, the latent carries the world.
# loss = recon(obs)+recon(reward)+recon(continue) + beta*KL(q||p) loss = recon_x + recon_r + recon_c + beta * kl_balanced(q, p) loss.backward() # then: plot decode(posterior(real_obs)) vs real_obs
- Roll the PRIOR in imagination. From a real start state, unroll the prior only
for N steps (no env, no pixels) — exactly the Figure C loop. Then run the true environment from the
same start with the same actions. Overlay them and watch where the dream diverges: short
horizons track well; error compounds the further you imagine. That divergence is the central limitation of
model-based RL made visible.
# imagined rollout: prior only z, h = posterior(real_obs0), h0 for t in range(N): a = actor(h, z) h = gru(h, z, a) z = prior(h) # <-- no observation used r_hat = reward_head(h, z) # compare trajectory to a true env rollout with the same actions
- Reading-forward. Pair the lab with ETH Lecture 8 (World Models) for the derivation, and the scalable-world-models paper arxiv.org/abs/2509.24527 for where this goes at scale.
gym →
gymnasium API change (5-tuple step) breaks older tutorials. Pin versions from the
repo's lockfile and verify the env API before training.
Done when
- You can explain the prior-matches-posterior KL — why the prior must predict the latent without the image.
- You can describe, mechanically, what "imagination" is: unrolling the prior in latent space, decoding rewards, training the actor–critic on it.
- You can say plainly why we are NOT training a world model in this sprint — data, compute, and time — and what we use the understanding for instead.