A.The integration challenge
Lead invariants and the fast/slow seam, made concrete as glue code.
We have spent five modules building three subsystems in isolation. Integration is not a new capability — it is the discipline of wiring what we have without violating the two invariants that have governed every page so far.
policy_server (module 3).
Every integration decision must survive a network hop and its latency.{phase} events;
the affect layer maps each phase to an {emotion, say} the fast loop consumes.B.The orchestrator track3_creature/orchestrator.py
An asyncio + websockets hub whose centerpiece is a small, explicit state machine.
The hub is deliberately tiny: it accepts websocket clients on /face, /body
and /voice, holds a single current state, and on every transition broadcasts.
The state machine is the creature's spine — it is the only place that knows "what we are doing now."
# track3_creature/orchestrator.py — asyncio + websockets hub import asyncio, json, websockets from affect import AFFECT # phase -> {emotion, say} from intent import parse_tool_call clients = {"face": set(), "body": set(), "voice": set()} state = "idle" async def broadcast(channel, msg): dead = [] for ws in clients[channel]: try: await ws.send(json.dumps(msg)) except Exception: dead.append(ws) for ws in dead: clients[channel].discard(ws) async def enter(new_state, **extra): global state state = new_state a = AFFECT.get(new_state, AFFECT["idle"]) # every transition: emotion -> /face, optional line -> /voice await broadcast("face", {"emotion": a["emotion"], "say": a.get("say"), **extra}) if a.get("say"): await broadcast("voice", {"say": a["say"]}) async def run_task(intent): task = parse_tool_call(intent) # -> LeRobot task string await enter("planning") await broadcast("body", {"task": task}) # dispatch SLOW loop # phases stream back from /body and drive enter() — see handler async def handler(ws, path): ch = path.strip("/") or "face" clients[ch].add(ws) try: async for raw in ws: m = json.loads(raw) if ch == "voice" and "intent" in m: # tool call arrived await enter("listening"); await run_task(m["intent"]) elif ch == "body" and "phase" in m: # slow loop reporting await enter(m["phase"]) # phase IS the new state finally: clients[ch].discard(ws) async def main(): async with websockets.serve(handler, "0.0.0.0", 8765): await enter("idle"); await asyncio.Future() asyncio.run(main())
C.Intent track3_creature/intent.py
One tool. The voice side decides what; the body side decides how.
From module 5, Moshi's Inner-Monologue text stream feeds an intent watcher. When the user says something actionable, the model emits exactly one tool call. We keep the surface minimal — a single tool with three actions — because the LeRobot policy, not the LLM, is responsible for motor competence.
# the ONE tool schema offered to the voice model DO_PHYSICAL_TASK = { "name": "do_physical_task", "description": "Move the creature's body to act on the world.", "parameters": { "type": "object", "properties": { "action": {"type": "string", "enum": ["fetch", "explore", "stop"]}, "object": {"type": "string", "description": "e.g. 'red cup'"}, "location": {"type": "string", "description": "optional hint, e.g. 'on the table'"} }, "required": ["action"] } }
# track3_creature/intent.py — tool call -> LeRobot task string def parse_tool_call(call: dict) -> str: a = call["action"] obj = call.get("object", "").strip() loc = call.get("location", "").strip() if a == "stop": return "stop and stay still" if a == "explore": return "explore the room and look around" # a == "fetch": build the SmolVLA language instruction where = f" {loc}" if loc else "" return f"drive to the {obj}{where} and pick it up"
do_physical_task call; parse_tool_call turns it into the natural-language
instruction SmolVLA was fine-tuned on. The LLM is a router, not a controller.D.Body client track3_creature/body_client.py
The seam where all three tracks meet — implemented fully, not stubbed.
This is the most load-bearing file in the project. It consumes a {task}, sets it as
SmolVLA's language instruction, runs the policy over the LeKiwi eval loop (base + arm together —
Invariant B), detects grasp success, and streams {phase} events back. It runs the
slow loop and reports into the fast loop — never the other way around.
# track3_creature/body_client.py — wraps the LeKiwi eval/inference loop import asyncio, json, websockets from lerobot.robots.lekiwi import make_lekiwi_client from lerobot.policies import make_policy_client # talks to policy_server (mod 3) HUB = "ws://localhost:8765/body" def grasp_succeeded(obs) -> bool: # gripper closed past threshold AND arm lifted above table height return obs["gripper_pos"] < GRIP_CLOSED and obs["wrist_z"] > LIFT_Z async def run_episode(ws, task: str): robot = make_lekiwi_client() # Pi host (Invariant A) policy = make_policy_client(task=task) # GPU server; task = instruction await ws.send(json.dumps({"phase": "searching"})) seen = False for step in range(MAX_STEPS): obs = robot.get_observation() # cameras + proprio action = await policy.select_action(obs) # base + arm in ONE vector robot.send_action(action) # the policy drives the base if not seen and object_in_view(obs): seen = True; await ws.send(json.dumps({"phase": "found"})) await ws.send(json.dumps({"phase": "navigating"})) if closing_on_object(obs): await ws.send(json.dumps({"phase": "grasping"})) if grasp_succeeded(obs): await ws.send(json.dumps({"phase": "reporting"})); return await asyncio.sleep(0) # yield: never block the loop # fell out of the loop without success await ws.send(json.dumps({"phase": "not_found" if not seen else "failed"})) async def main(): async with websockets.connect(HUB) as ws: async for raw in ws: m = json.loads(raw) if "task" in m: try: await run_episode(ws, m["task"]) except Exception: await ws.send(json.dumps({"phase": "failed"})) asyncio.run(main())
make_lekiwi_client, make_policy_client) are illustrative — pin your LeRobot
commit and confirm the exact eval/inference entrypoints for that commit before wiring.
The contract that matters is stable: set task → step obs/action → emit phase.How this maps to my build The body client is the one place where Track 1's policy, Track 2's task, and Track 3's phase stream all touch. It runs the slow loop and emits five small JSON events — that is the entire decoupling, made real.
E.Navigation track3_creature/nav.py
Three tiers. Default to A — because of Invariant B.
Invariant B is why Tier A is the default and not an afterthought: the base velocities are in the action vector, so the policy approaches the object as part of grasping. Tier B is a thin shim — rotate in place, run an open-vocabulary detector, and once the object is in frame, hand control straight back to the VLA. Tier C is a stretch: a full mapping stack, used only for whole-room search.
# track3_creature/nav.py — pick the tier, default A def choose_tier(obs, task): if object_in_view(obs): return "A" # let the VLA drive (default) if task.action == "fetch": return "B" # rotate-scan to put it in frame, then -> A return "C" # explore / whole-room (stretch)
foxglove.dev/blog/upgrading-the-lekiwi-into-a-lidar-equipped-explorer.
Autonomy scaffold: github.com/alfredang/lerobot. Treat both as Tier-C reading,
not day-one scope.F.Affect track3_creature/affect.py
The phase → {emotion, say} map that is the personality.
Personality is not a model — it is a lookup table. Each state the orchestrator enters maps to a face emotion and an optional spoken line. Swap this table and you change the creature's character without touching a single line of control code.
| Phase | Emotion (face) | Spoken line (voice, optional) |
|---|---|---|
idle | calm | — |
listening | attentive | "mm-hm" |
planning | thinking | "on it" |
searching | curious | "let me look" |
found | alert | "got it" |
navigating | focused | — |
grasping | effortful | — |
reporting (success) | happy | "here you go" |
not_found | sad | "I couldn't find it" |
failed | apologetic | "sorry, I dropped it" |
# track3_creature/affect.py — the table that is the character AFFECT = { "idle": {"emotion": "calm"}, "listening": {"emotion": "attentive", "say": "mm-hm"}, "planning": {"emotion": "thinking", "say": "on it"}, "searching": {"emotion": "curious", "say": "let me look"}, "found": {"emotion": "alert", "say": "got it"}, "navigating": {"emotion": "focused"}, "grasping": {"emotion": "effortful"}, "reporting": {"emotion": "happy", "say": "here you go"}, "not_found": {"emotion": "sad", "say": "I couldn't find it"}, "failed": {"emotion": "apologetic", "say": "sorry, I dropped it"}, }
G.Embodied-integration theory
Why this architecture — symbol grounding, affordances, fast/slow, legibility — with the ETH cards.
Symbol grounding
How does the token "red cup" connect to pixels and motor torques? Classically this is the symbol grounding problem: symbols in a logic engine have no intrinsic link to the world. Our answer is to never separate them. SmolVLA ingests the language instruction and the camera frames in one network and emits actions — the grounding is implicit and learned, not a lookup from word to coordinate.
The instruction ℓ and observation ot jointly condition the action — the cup is grounded in the weights, not a database.
Affordances
An affordance is what an object offers an agent — a cup affords grasping at its rim. We do not enumerate affordances; the policy infers them from demonstration data. The intent layer names the goal ("fetch the cup"), and the affordance — how to grip this shape — lives in the VLA.
Fast / slow decoupling — now fully concrete
Sections A–F made the principle real: the slow loop (body_client) runs GPU-bound episodes and emits five small phase events; the fast loop (face + voice) consumes affect at interactive latency. The websocket bus is the membrane between them. Neither loop ever waits on the other.
Legibility & HRI
The system is designed so a bystander can infer the creature's internal state from its outward behavior alone. This is the integration payoff: the same phase that controls the body also paints the face, so the expression can never lie about what the body is doing.
Maps to ETH Lectures 10 & 11
Intelligent behavior emerges from many tight perception–action loops layered by subsumption — no central world model, no symbolic planner. Competence is decomposed by behavior, and the world is "its own best model."
A blueprint for autonomy: a learned world model plus a configurator and a two-speed actor — a fast reactive policy and a slower deliberative/planning module — coordinated around perception and a cost (intrinsic motivation).
General methods that scale with compute and data beat hand-engineered, knowledge-laden systems over time. Search and learning win; clever priors plateau.
Robots can imitate from a handful of in-context demonstrations at test time, adapting behavior without weight updates — generalizing from prompts rather than retraining.
An LLM-driven embodied agent that explores open-endedly, writes and reuses a growing library of skills, and improves via self-driven curriculum — reasoning at the task level while skills handle execution.
Test-time deliberation helps, but embodied agents are latency-bound; the art is spending reasoning compute only where it changes the action, keeping the control loop fast.
How this maps to my build Brooks tells us the orchestrator can be dumb (behaviors, not a model); LeCun tells us the slow/fast split is the right shape; Sutton tells us to keep the learned VLA central. The architecture in sections A–F is the direct consequence of these three.
Bring up every track and run one sentence through the whole creature.
- Start the LeKiwi host on the Pi. Bring up the robot server so it streams cameras + proprio and accepts actions (module 3). Confirm the base and arm respond.
- Start the policy_server on the GPU. Load the fine-tuned SmolVLA checkpoint (Invariant A — brain off the robot). Smoke-test that it returns an action for a dummy obs.
- Start Moshi + the intent watcher (module 5). Confirm a spoken actionable
sentence emits a
do_physical_tasktool call on its websocket. - Start the face client (module 5) and connect it to
ws://HUB:8765/face. It should render theidleemotion on connect. - Start the orchestrator (
orchestrator.py) on the hub host. Then startbody_client.py, which connects to/body. Watch all three channels register. - Speak: "grab the red cup." Trace the cascade: intent json → task string →
searching→found→navigating→grasping→reporting, with the face changing emotion and the voice saying its line at each step. - Watch both surfaces move together. The body drives + grasps via the VLA while the
face/voice track the same phases. If the face freezes during a grasp, your loops are coupled — fix
the missing
await asyncio.sleep(0)yield.
Message-bus contract
| Direction | Channel | Shape |
|---|---|---|
| voice → hub | /voice | {"intent": {"action": "fetch", "object": "red cup", "location": "on the table"}} |
| hub → body | /body | {"task": "drive to the red cup on the table and pick it up"} |
| body → hub | /body | {"phase": "navigating"} |
| hub → face | /face | {"emotion": "focused", "say": null, "mouth": 0.0} |
| hub → voice | /voice | {"say": "here you go"} |
Keep messages tiny and one-shot. The bus carries intent, task, phase, and affect — nothing else crosses it.
Track 3 done when
- End-to-end run recorded: spoken command → expressive ack → drive → grasp via the fine-tuned VLA → expressive report.
- The face and voice track the body's phases live, never freezing during a grasp (fast/slow stays decoupled).
not_foundandfailedpaths also produce a legible, expressive response — not a silent hang.- Reliability around 60–70% end-to-end. That is a strong portfolio demo — honest scope, not research-grade.