Module 4 · Theory-deep · maps to ETH Lecture 8

World Models

RSSM, latent imagination, and learning to dream. A world model learns the environment's dynamics so an agent can plan and learn inside its own head — instead of paying for every lesson with slow, fragile, expensive real-robot time.

Honest scope
We are not training a world model from scratch in 14 days. This page is for understanding (plus one small runnable lab). The payoff is conceptual: it explains why our async VLA + fast/slow architecture is shaped the way it is, and it marks the upgrade path for "what if we had more than 14 days."

A · What a world model is, and why robots need one

A world model is a learned simulator of the environment's dynamics. Give it the current latent state and a candidate action, and it predicts the next state, the reward, and whether the episode continues — without touching the real world. Once you have that, the agent can generate millions of trajectories by imagination and train its policy on them.

For robotics this is the whole game, because real experience is the scarcest resource you have:

Slow. A real grasp takes seconds; a simulated/imagined one takes microseconds. You cannot collect millions of real episodes.
Fragile. Exploration breaks hardware. Dreaming a bad action costs nothing; doing it can snap a gripper.
Expensive. Every real rollout needs a human babysitting resets. Imagination needs only GPU.
Model-free / real loop — costly Real environment (robot + world) Policy π(a|x) acts on pixels obs x, reward r action a slow · fragile · 1 sample per real step Model-based / imagination loop — cheap Learned world model prior p(z'|z,a) → r̂, ĉ Actor–Critic trains in latent space latent z, r̂ action a fast · safe · millions of samples per GPU-hour
Figure A — The real loop pays for every sample with hardware time. The imagination loop runs the same control loop, but the "environment" is a learned model rolling forward in latent space.
AxisModel-free (e.g. PPO on raw experience)Model-based (world model)
What it learnsA policy/value directly from reward.The dynamics; the policy is trained on imagined rollouts.
Sample efficiencyLow — needs huge real interaction.High — reuses each real sample to fit the model, then dreams.
Failure modeBurns real episodes to explore.Compounding model error; the dream drifts from reality.
Robotics fitPainful — hardware is the bottleneck.Natural — minimise real interaction, maximise GPU.
Maps to my build The reason world models matter to ANIMA-Kiwi is the predict-without-observing trick. That same trick — running a fast loop on a predicted state while a slow loop catches up — is exactly what our async VLA exploits at inference time. Learn it here, reuse the intuition on the body.

B · The RSSM — Recurrent State-Space Model

The Dreamer family's core idea is to split the latent state into two paths so the model gets the best of both worlds:

The combined model state is st = (ht, zt). Everything is decoded from that pair. Four learned components do the work:

GRU (1) recurrent ht GRU ht+1 zt−1, at−1 deterministic memory → (2) POSTERIOR q q(zt | ht, xt) obs xt (3) PRIOR p (no obs!) p(zt | ht) KL: prior ← posterior zt (4) DECODERS from (h, z) obs: p(x̂t | ht, zt) reward: p(r̂t | ·) continue: p(ĉt | ·) Training uses the posterior (sees x). Imagination uses the prior (does not).
Figure B — The four RSSM components. The deterministic GRU carries memory; the posterior and prior both produce zt, but only the posterior peeks at the observation. The KL term drags the prior toward the posterior so the prior can stand alone at dream time.
recurrent / deterministic posterior (uses obs) prior (no obs) decoders
1 · Recurrent model
Deterministic GRU. ht = f(ht−1, zt−1, at−1). A stable, differentiable memory of the past.
2 · Representation / posterior
q(zt | ht, xt). Infers the stochastic latent using the current observation. The "encoder" half.
3 · Transition / prior
p(zt | ht). Predicts the latent without the observation. This is the part that lets us dream.
4 · Decoders
From (ht, zt) reconstruct observation , reward , and continue/discount ĉ.
Why two ways to get z?
At training time you have the image, so the posterior gives you the most accurate z. At imagination time there is no image — so you must rely on the prior. The objective forces these two to agree, which is the entire trick. Keep this in mind for section C.

C · The learning objective — an ELBO that makes prior match posterior

The world model is a sequential latent-variable model trained by maximising a variational lower bound (ELBO) — equivalently, minimising a loss with two parts: reconstruct what you saw, and make the prior predict the posterior.

L(θ) = Σt   −𝔼q[ ln p(xt|ht,zt) + ln p(rt|·) + ln p(ct|·) ]  +  β · KL( q(zt|ht,xt) ‖ p(zt|ht) )

Teal term = reconstruction (the model must explain the observation, reward, and continue flag). Clay term = the KL that ties the prior to the posterior.

In words: the reconstruction term says the model's decoded image, reward, and continue signal — read off the latent — must match what actually happened. That forces the latent z to carry real information about the world.

The KL term is the heart of it. The posterior q is allowed to look at the image and so produces a good z. The prior p must produce the same distribution without the image, using only the recurrent state ht. Minimising KL(qp) trains the prior to predict the next latent blind. That is precisely the capability you exploit at imagination time: roll the prior forward in latent space, decoding rewards as you go, and you never need a pixel again.

KL balancing & free bits (high level)
Naively, the KL can be minimised the wrong way — by making the posterior lazy instead of the prior smart. Dreamer fixes this with KL balancing: weight the gradient so the prior moves toward the posterior faster than the posterior collapses toward the prior (e.g. 0.8 / 0.2). Free bits clip the KL below a small floor (~1 nat) so the model does not waste capacity squeezing out an already-tiny divergence — preventing posterior collapse.
Maps to my build "Predict the next latent without the observation" is the model-side mirror of our control-side fast/slow split: a fast loop runs on a predicted state, a slow loop corrects with real perception when it arrives. Same idea, two layers of the stack.

D · Latent imagination — training the agent inside the dream

Once the world model is trained, we freeze it and train an actor–critic entirely on rollouts the model generates. Starting from real latent states (encoded from a replay buffer), we unroll the prior forward: the actor proposes at, the prior gives zt+1, the reward head gives t, and the critic estimates returns that back up the chain. No pixels. No real robot. Pure latent space.

Imagined latent trajectory (prior-only rollout) zt zt+1 zt+2 zt+3 prior · at prior · at+1 prior · at+2 t t+1 t+2 critic value V backs up ← (λ-returns over the dream) real start
Figure C — Latent imagination. The prior unrolls forward (pulsing nodes), the actor chooses each action, the reward head scores each step, and the critic's value estimate is backed up across the imagined horizon. With motion off this still reads as a left-to-right rollout with a value arrow returning.
prior transition + action predicted reward value backup

This is "background reasoning" in the most literal sense: the agent improves its policy by thinking through consequences in a compressed internal model, the way you rehearse a tricky move in your head before attempting it. It is planning and learning decoupled from acting.

Connecting to the course's fast/slow theme
Every module here returns to one idea: separate the loop that acts now from the loop that deliberates. World models give the cleanest version — deliberation happens entirely off the robot, in latent space, at GPU speed. The async VLA is the on-robot cousin of the same decoupling.

E · World models for manipulation & the frontier

Two research directions are converging on the same bet — predicting the future is the universal pretext task — and they bracket where VLAs and world models meet.

UniPi — text-guided video as a universal policy

pixel-space planning
Du et al., "Learning Universal Policies via Text-Guided Video Generation"

Reframes decision-making as conditional video generation: given a language goal and the current frame, a diffusion model generates the future video of the task being done; an inverse-dynamics model then reads actions out of the predicted frames. Because it plans in pixels, one model transfers across embodiments and tasks — the "world model" and the "policy" become the same generative object. The cost is heavy generation and the usual video-prediction artifacts.

Maps to my build the same predict-then-act spirit as our VLA; shows pixels can be the plan.

Scalable world models (Dreamer line)

latent-space, recent
Hafner et al., "Training Agents Inside of Scalable World Models" (DreamerV4-style)

Pushes the RSSM recipe to scale: a more capable latent world model trained on diverse data, with agents learned inside it via imagination — the line that took Dreamer from toy control to broad, long-horizon tasks. The headline is generality and sample efficiency from a single learned model rather than bespoke simulators. See arxiv.org/abs/2509.24527.

Maps to my build the exact upgrade path: same RSSM ideas, just far more compute and data than a sprint allows.

DreamZero — world-model frontier

frontier
Ye et al., "DreamZero / DreamerZero"

Represents the cutting edge of dreaming agents — pushing latent imagination toward stronger, more autonomous policy learning with minimal real interaction, blurring the remaining line between a "world model" and a "general agent." The takeaway for us is directional rather than reproducible: the field is consolidating on learned models as the training substrate, with VLAs and world models steadily merging into a single predict-the-future objective.

Maps to my build confirms the direction — predict-the-future is becoming the shared pretext for body + brain.

Card summaries are kept under ~200 words each and describe the idea; verify exact titles/results against the source before citing in writing.

So where could this take ANIMA-Kiwi beyond the sprint? Honestly: not far in 14 days, but the path is clear. A learned world model would let us do imagination-based data augmentation (dream extra grasp attempts instead of collecting them by hand) and cheap success prediction (roll the model forward a few steps to ask "will this grasp succeed?" before committing the arm). Those are exactly the upgrades you reach for once the creature works and you have more than two weeks.

Maps to my build VLAs and world models are converging on one objective — predict the future. Understanding RSSM is why we accept SmolVLA conditioning on state + vision: it is a (much smaller) bet that the model's internal state should already encode enough dynamics to choose the next action.

F · How this maps to my build — the three takeaways

async VLA

The prior/posterior split is the predict-without-observation idea our async inference exploits: act on a predicted state, correct when perception lands.

why state+vision

RSSM clarifies why SmolVLA conditions on both proprioceptive state and vision — the internal state must carry enough dynamics to act, not just classify a frame.

upgrade path

World models are the "more than 14 days" tier: imagination-based augmentation and grasp-success prediction, built on the very recipe taught here.

Lab Dream a tiny world

Small, runnable, modest compute. No robot. The goal is mechanical understanding of imagination — not a trained agent.

  1. Pick a toy environment. Use something tiny — a small gridworld, the classic Pendulum-v1, or the Dreamer tutorial's default task. Either run a vetted reference DreamerV3 implementation, or build a stripped-down RSSM yourself (a GRU for ht, two small MLPs for posterior q and prior p, and three decoder heads for obs/reward/continue). Collect a few thousand transitions into a replay buffer.
    # sanity check the env first
    import gymnasium as gym
    env = gym.make("Pendulum-v1")
    obs, _ = env.reset()
    # roll a random policy to fill a tiny replay buffer
    for _ in range(2000):
        a = env.action_space.sample()
        obs, r, term, trunc, _ = env.step(a)
        buffer.add(obs, a, r, term or trunc)
  2. Train the RSSM on the transitions. Optimise the ELBO from section C (reconstruction + KL with KL-balancing + free bits). Then verify the model understands the world: feed real observations through the posterior and decode — compare reconstructed vs ground-truth observations side by side. If reconstructions are crisp, the latent carries the world.
    # loss = recon(obs)+recon(reward)+recon(continue) + beta*KL(q||p)
    loss = recon_x + recon_r + recon_c + beta * kl_balanced(q, p)
    loss.backward()
    # then: plot decode(posterior(real_obs)) vs real_obs
  3. Roll the PRIOR in imagination. From a real start state, unroll the prior only for N steps (no env, no pixels) — exactly the Figure C loop. Then run the true environment from the same start with the same actions. Overlay them and watch where the dream diverges: short horizons track well; error compounds the further you imagine. That divergence is the central limitation of model-based RL made visible.
    # imagined rollout: prior only
    z, h = posterior(real_obs0), h0
    for t in range(N):
        a = actor(h, z)
        h = gru(h, z, a)
        z = prior(h)              # <-- no observation used
        r_hat = reward_head(h, z)
    # compare trajectory to a true env rollout with the same actions
  4. Reading-forward. Pair the lab with ETH Lecture 8 (World Models) for the derivation, and the scalable-world-models paper arxiv.org/abs/2509.24527 for where this goes at scale.
Version-sensitive
DreamerV3 reference repos pin specific JAX/PyTorch + Gymnasium versions; the old gymgymnasium API change (5-tuple step) breaks older tutorials. Pin versions from the repo's lockfile and verify the env API before training.

Done when