Module 6 · Integration

The Creature

Where all three tracks fuse. A small asyncio hub turns a spoken command into an expressive act: Track 1 · Body drives and grasps, Track 2 · Voice listens and speaks, and the orchestrator — Track 3 · Creature — gives it a phase, an emotion, and a will.

Track 3 · Creature

A.The integration challenge

Lead invariants and the fast/slow seam, made concrete as glue code.

We have spent five modules building three subsystems in isolation. Integration is not a new capability — it is the discipline of wiring what we have without violating the two invariants that have governed every page so far.

Invariant A — tethered compute
The robot's brain does not live on the robot. The Pi (host) streams observations and receives actions; the SmolVLA policy runs on a GPU policy_server (module 3). Every integration decision must survive a network hop and its latency.
Invariant B — navigation lives in the action space
The LeKiwi base velocities are part of the policy's action vector. The model drives as naturally as it moves the arm. We do not bolt a separate navigation stack on by default — locomotion is a learned action (module 3).
Fast / slow decoupling — the glue principle
The slow loop is embodied control: perceive → act at the policy's cadence, seconds-long episodes, blocking I/O, GPU-bound. The fast loop is presence: the face animates and the voice replies in tens of milliseconds. If these share a thread, a slow grasp freezes the creature's face and it dies as a character. The orchestrator's entire job is to let the slow loop report into the fast loop without ever blocking it. Concretely: the body client runs the slow VLA loop and emits small {phase} events; the affect layer maps each phase to an {emotion, say} the fast loop consumes.
orchestrator.py asyncio websocket hub state machine + broadcast /voice client Moshi + intent watcher /body client SmolVLA eval loop /face client expressive renderer intent json → task string → ← phase events emotion / say → ← say (optional)
The seam. Intent flows from Voice into the hub; the hub dispatches a task to Body, Body streams phases back, and the hub broadcasts emotion+speech to Face (and optionally Voice).
Track 1 · Body (clay) Track 2 · Voice (teal) Track 3 · Creature (amber)

B.The orchestrator track3_creature/orchestrator.py

An asyncio + websockets hub whose centerpiece is a small, explicit state machine.

idle listening planning searching found navigating grasping reporting not_found failed reporting / not_found / failed → idle
State machine. Green = success terminal; amber border = warning terminals; every transition broadcasts an emotion to /face and (optionally) a line to /voice.

The hub is deliberately tiny: it accepts websocket clients on /face, /body and /voice, holds a single current state, and on every transition broadcasts. The state machine is the creature's spine — it is the only place that knows "what we are doing now."

# track3_creature/orchestrator.py — asyncio + websockets hub
import asyncio, json, websockets
from affect import AFFECT       # phase -> {emotion, say}
from intent import parse_tool_call

clients = {"face": set(), "body": set(), "voice": set()}
state = "idle"

async def broadcast(channel, msg):
    dead = []
    for ws in clients[channel]:
        try: await ws.send(json.dumps(msg))
        except Exception: dead.append(ws)
    for ws in dead: clients[channel].discard(ws)

async def enter(new_state, **extra):
    global state
    state = new_state
    a = AFFECT.get(new_state, AFFECT["idle"])
    # every transition: emotion -> /face, optional line -> /voice
    await broadcast("face", {"emotion": a["emotion"], "say": a.get("say"), **extra})
    if a.get("say"):
        await broadcast("voice", {"say": a["say"]})

async def run_task(intent):
    task = parse_tool_call(intent)              # -> LeRobot task string
    await enter("planning")
    await broadcast("body", {"task": task})    # dispatch SLOW loop
    # phases stream back from /body and drive enter() — see handler

async def handler(ws, path):
    ch = path.strip("/") or "face"
    clients[ch].add(ws)
    try:
        async for raw in ws:
            m = json.loads(raw)
            if ch == "voice" and "intent" in m:   # tool call arrived
                await enter("listening"); await run_task(m["intent"])
            elif ch == "body" and "phase" in m:    # slow loop reporting
                await enter(m["phase"])           # phase IS the new state
    finally:
        clients[ch].discard(ws)

async def main():
    async with websockets.serve(handler, "0.0.0.0", 8765):
        await enter("idle"); await asyncio.Future()

asyncio.run(main())
Why a state machine, not a planner
The phases are the plan. The body client reports them, the orchestrator merely relays a phase into an emotion. There is no central reasoning model deciding what to feel — feeling is a pure function of phase. This is Brooks' subsumption logic at the system level (section G).

C.Intent track3_creature/intent.py

One tool. The voice side decides what; the body side decides how.

From module 5, Moshi's Inner-Monologue text stream feeds an intent watcher. When the user says something actionable, the model emits exactly one tool call. We keep the surface minimal — a single tool with three actions — because the LeRobot policy, not the LLM, is responsible for motor competence.

# the ONE tool schema offered to the voice model
DO_PHYSICAL_TASK = {
  "name": "do_physical_task",
  "description": "Move the creature's body to act on the world.",
  "parameters": {
    "type": "object",
    "properties": {
      "action":   {"type": "string", "enum": ["fetch", "explore", "stop"]},
      "object":   {"type": "string", "description": "e.g. 'red cup'"},
      "location": {"type": "string", "description": "optional hint, e.g. 'on the table'"}
    },
    "required": ["action"]
  }
}
# track3_creature/intent.py — tool call -> LeRobot task string
def parse_tool_call(call: dict) -> str:
    a   = call["action"]
    obj = call.get("object", "").strip()
    loc = call.get("location", "").strip()
    if a == "stop":
        return "stop and stay still"
    if a == "explore":
        return "explore the room and look around"
    # a == "fetch": build the SmolVLA language instruction
    where = f" {loc}" if loc else ""
    return f"drive to the {obj}{where} and pick it up"
Connects to module 5
The voice model never speaks the task string. Its Inner-Monologue intent watcher emits the structured do_physical_task call; parse_tool_call turns it into the natural-language instruction SmolVLA was fine-tuned on. The LLM is a router, not a controller.

D.Body client track3_creature/body_client.py

The seam where all three tracks meet — implemented fully, not stubbed.

This is the most load-bearing file in the project. It consumes a {task}, sets it as SmolVLA's language instruction, runs the policy over the LeKiwi eval loop (base + arm together — Invariant B), detects grasp success, and streams {phase} events back. It runs the slow loop and reports into the fast loop — never the other way around.

# track3_creature/body_client.py — wraps the LeKiwi eval/inference loop
import asyncio, json, websockets
from lerobot.robots.lekiwi import make_lekiwi_client
from lerobot.policies import make_policy_client   # talks to policy_server (mod 3)

HUB = "ws://localhost:8765/body"

def grasp_succeeded(obs) -> bool:
    # gripper closed past threshold AND arm lifted above table height
    return obs["gripper_pos"] < GRIP_CLOSED and obs["wrist_z"] > LIFT_Z

async def run_episode(ws, task: str):
    robot  = make_lekiwi_client()                  # Pi host (Invariant A)
    policy = make_policy_client(task=task)          # GPU server; task = instruction
    await ws.send(json.dumps({"phase": "searching"}))
    seen = False
    for step in range(MAX_STEPS):
        obs    = robot.get_observation()           # cameras + proprio
        action = await policy.select_action(obs)  # base + arm in ONE vector
        robot.send_action(action)                  # the policy drives the base
        if not seen and object_in_view(obs):
            seen = True; await ws.send(json.dumps({"phase": "found"}))
            await ws.send(json.dumps({"phase": "navigating"}))
        if closing_on_object(obs):
            await ws.send(json.dumps({"phase": "grasping"}))
        if grasp_succeeded(obs):
            await ws.send(json.dumps({"phase": "reporting"})); return
        await asyncio.sleep(0)                     # yield: never block the loop
    # fell out of the loop without success
    await ws.send(json.dumps({"phase": "not_found" if not seen else "failed"}))

async def main():
    async with websockets.connect(HUB) as ws:
        async for raw in ws:
            m = json.loads(raw)
            if "task" in m:
                try: await run_episode(ws, m["task"])
                except Exception: await ws.send(json.dumps({"phase": "failed"}))

asyncio.run(main())
Verify before you run
LeRobot's client/policy-server APIs move fast. The factory names above (make_lekiwi_client, make_policy_client) are illustrative — pin your LeRobot commit and confirm the exact eval/inference entrypoints for that commit before wiring. The contract that matters is stable: set task → step obs/action → emit phase.

How this maps to my build  The body client is the one place where Track 1's policy, Track 2's task, and Track 3's phase stream all touch. It runs the slow loop and emits five small JSON events — that is the entire decoupling, made real.

E.Navigation track3_creature/nav.py

Three tiers. Default to A — because of Invariant B.

is the object in frame? read the wrist / base camera Tier A · VLA-driven policy drives the base (default — Invariant B) no extra stack Tier B · rotate-and-scan spin in place + open-vocab detector handoff, then → A object may start off-frame Tier C · LiDAR + Nav2 whole-room mapping STRETCH / stub only ROS2 navigation stack yes no → scan whole-room
Decision tree. Default is Tier A; Tier B exists only to put the object in frame and then hand back to A; Tier C is a stretch stub.
Tier A · VLA (default) Tier B · scan handoff Tier C · stretch

Invariant B is why Tier A is the default and not an afterthought: the base velocities are in the action vector, so the policy approaches the object as part of grasping. Tier B is a thin shim — rotate in place, run an open-vocabulary detector, and once the object is in frame, hand control straight back to the VLA. Tier C is a stretch: a full mapping stack, used only for whole-room search.

# track3_creature/nav.py — pick the tier, default A
def choose_tier(obs, task):
    if object_in_view(obs):
        return "A"                # let the VLA drive (default)
    if task.action == "fetch":
        return "B"                # rotate-scan to put it in frame, then -> A
    return "C"                    # explore / whole-room (stretch)
References (stretch)
LiDAR + Nav2 LeKiwi build: foxglove.dev/blog/upgrading-the-lekiwi-into-a-lidar-equipped-explorer. Autonomy scaffold: github.com/alfredang/lerobot. Treat both as Tier-C reading, not day-one scope.

F.Affect track3_creature/affect.py

The phase → {emotion, say} map that is the personality.

Personality is not a model — it is a lookup table. Each state the orchestrator enters maps to a face emotion and an optional spoken line. Swap this table and you change the creature's character without touching a single line of control code.

PhaseEmotion (face)Spoken line (voice, optional)
idlecalm
listeningattentive"mm-hm"
planningthinking"on it"
searchingcurious"let me look"
foundalert"got it"
navigatingfocused
graspingeffortful
reporting (success)happy"here you go"
not_foundsad"I couldn't find it"
failedapologetic"sorry, I dropped it"
# track3_creature/affect.py — the table that is the character
AFFECT = {
  "idle":       {"emotion": "calm"},
  "listening":  {"emotion": "attentive", "say": "mm-hm"},
  "planning":   {"emotion": "thinking",   "say": "on it"},
  "searching":  {"emotion": "curious",    "say": "let me look"},
  "found":      {"emotion": "alert",      "say": "got it"},
  "navigating": {"emotion": "focused"},
  "grasping":   {"emotion": "effortful"},
  "reporting":  {"emotion": "happy",      "say": "here you go"},
  "not_found":  {"emotion": "sad",        "say": "I couldn't find it"},
  "failed":     {"emotion": "apologetic", "say": "sorry, I dropped it"},
}
{phase} AFFECT lookup {emotion, say} /face · emotion (+mouth) /voice · say (optional)
Affect flow. One phase fans out into an emotion for the face and an optional line for the voice.
Upgrade path (callback to module 5)
Discrete emotions are the day-one design. The upgrade is a continuous valence–arousal space: instead of snapping to "happy" or "sad," map each phase to a point (v, a) and interpolate between targets so expressions blend smoothly — exactly the affect representation introduced on the Interaction page.
Legibility (HRI principle)
The face is not decoration. An expression that reliably reveals internal state lets a human read the robot and act sooner — "thinking" tells you to wait; "sad" tells you the search failed without parsing a log. Legible motion and affect reduce the human's uncertainty about what the machine will do next. Design every emotion to disambiguate a state, not merely to charm.

G.Embodied-integration theory

Why this architecture — symbol grounding, affordances, fast/slow, legibility — with the ETH cards.

Symbol grounding

How does the token "red cup" connect to pixels and motor torques? Classically this is the symbol grounding problem: symbols in a logic engine have no intrinsic link to the world. Our answer is to never separate them. SmolVLA ingests the language instruction and the camera frames in one network and emits actions — the grounding is implicit and learned, not a lookup from word to coordinate.

at = πθ( ot, ℓ )  where ℓ = "drive to the red cup and pick it up"

The instruction ℓ and observation ot jointly condition the action — the cup is grounded in the weights, not a database.

Affordances

An affordance is what an object offers an agent — a cup affords grasping at its rim. We do not enumerate affordances; the policy infers them from demonstration data. The intent layer names the goal ("fetch the cup"), and the affordance — how to grip this shape — lives in the VLA.

Fast / slow decoupling — now fully concrete

Sections A–F made the principle real: the slow loop (body_client) runs GPU-bound episodes and emits five small phase events; the fast loop (face + voice) consumes affect at interactive latency. The websocket bus is the membrane between them. Neither loop ever waits on the other.

Legibility & HRI

The system is designed so a bystander can infer the creature's internal state from its outward behavior alone. This is the integration payoff: the same phase that controls the body also paints the face, so the expression can never lie about what the body is doing.

Maps to ETH Lectures 10 & 11

ETH L11 · Frontier
Brooks — Intelligence without Representation

Intelligent behavior emerges from many tight perception–action loops layered by subsumption — no central world model, no symbolic planner. Competence is decomposed by behavior, and the world is "its own best model."

Maps to my build
Our orchestrator is a behavior switchboard, not a planner: feeling is a pure function of phase, and the VLA closes the perception–action loop directly. No central model decides what to feel.
ETH L11 · Frontier
LeCun — Autonomous Machine Intelligence

A blueprint for autonomy: a learned world model plus a configurator and a two-speed actor — a fast reactive policy and a slower deliberative/planning module — coordinated around perception and a cost (intrinsic motivation).

Maps to my build
Directly justifies our fast/slow split: the body client is the slow deliberative actor; the affect+face loop is the fast reactive presence. The orchestrator is the configurator routing intent to action.
ETH L11 · Frontier
Sutton — The Bitter Lesson

General methods that scale with compute and data beat hand-engineered, knowledge-laden systems over time. Search and learning win; clever priors plateau.

Maps to my build
We push competence into the learned VLA (Invariant B: even navigation is learned), not a hand-coded nav stack. The orchestrator stays thin so the model does the heavy lifting.
ETH L10 · Embodied Reasoning
Fu et al. — In-Context Imitation Learning

Robots can imitate from a handful of in-context demonstrations at test time, adapting behavior without weight updates — generalizing from prompts rather than retraining.

Maps to my build
Frames why a single fine-tuned SmolVLA can stretch across cup variants: the task string is an in-context conditioner, so new objects need fewer dedicated demos.
ETH L10 · Embodied Reasoning
Wang et al. — Voyager

An LLM-driven embodied agent that explores open-endedly, writes and reuses a growing library of skills, and improves via self-driven curriculum — reasoning at the task level while skills handle execution.

Maps to my build
Our intent layer is a one-skill Voyager: the LLM emits a high-level tool call and the VLA executes. The upgrade path is a skill library the orchestrator can compose.
ETH L10 · Test-time Scaling
Efficient Embodied Reasoning

Test-time deliberation helps, but embodied agents are latency-bound; the art is spending reasoning compute only where it changes the action, keeping the control loop fast.

Maps to my build
Exactly our fast/slow budget: deliberate at intent time (cheap, rare), react at control time (fast, continuous). We never put an LLM inside the motor loop.

How this maps to my build  Brooks tells us the orchestrator can be dumb (behaviors, not a model); LeCun tells us the slow/fast split is the right shape; Sutton tells us to keep the learned VLA central. The architecture in sections A–F is the direct consequence of these three.

Lab Close the loop — spoken command → expressive act, end to end

Bring up every track and run one sentence through the whole creature.

  1. Start the LeKiwi host on the Pi. Bring up the robot server so it streams cameras + proprio and accepts actions (module 3). Confirm the base and arm respond.
  2. Start the policy_server on the GPU. Load the fine-tuned SmolVLA checkpoint (Invariant A — brain off the robot). Smoke-test that it returns an action for a dummy obs.
  3. Start Moshi + the intent watcher (module 5). Confirm a spoken actionable sentence emits a do_physical_task tool call on its websocket.
  4. Start the face client (module 5) and connect it to ws://HUB:8765/face. It should render the idle emotion on connect.
  5. Start the orchestrator (orchestrator.py) on the hub host. Then start body_client.py, which connects to /body. Watch all three channels register.
  6. Speak: "grab the red cup." Trace the cascade: intent json → task string → searchingfoundnavigatinggraspingreporting, with the face changing emotion and the voice saying its line at each step.
  7. Watch both surfaces move together. The body drives + grasps via the VLA while the face/voice track the same phases. If the face freezes during a grasp, your loops are coupled — fix the missing await asyncio.sleep(0) yield.

Message-bus contract

DirectionChannelShape
voice → hub/voice{"intent": {"action": "fetch", "object": "red cup", "location": "on the table"}}
hub → body/body{"task": "drive to the red cup on the table and pick it up"}
body → hub/body{"phase": "navigating"}
hub → face/face{"emotion": "focused", "say": null, "mouth": 0.0}
hub → voice/voice{"say": "here you go"}

Keep messages tiny and one-shot. The bus carries intent, task, phase, and affect — nothing else crosses it.

Track 3 done when

  • End-to-end run recorded: spoken command → expressive ack → drive → grasp via the fine-tuned VLA → expressive report.
  • The face and voice track the body's phases live, never freezing during a grasp (fast/slow stays decoupled).
  • not_found and failed paths also produce a legible, expressive response — not a silent hang.
  • Reliability around 60–70% end-to-end. That is a strong portfolio demo — honest scope, not research-grade.