Research Note
Hardware
AI Agents

Embodied Agents: Hardware for AI

The intelligence is here. The body isn't. That's the bottleneck nobody is building for.

May 2026 · Draft v0.2

The State of Play

AI agents crossed an inflection point in 2025. Reasoning models can write production code, manage complex workflows, and hold multi-turn conversations indistinguishable from a human colleague. The intelligence layer is no longer the constraint. But look at where these agents actually live -- and what they can actually perceive -- and the picture changes completely.

$23B
Embodied AI market by 2030
39% CAGR · MarketsandMarkets
156M
Smart speakers shipped in 2025
$21.4B global market · ElectroIQ
$48B
Agentic AI market by 2030
6x growth from 2025 · BCC Research

Two numbers tell the story. The agentic AI market is projected to hit $48 billion by 2030 -- a 6x increase from 2025. Meanwhile, 156 million smart speakers shipped last year. There's enormous demand for AI in the physical world. But the devices people actually own are running decade-old voice interaction models that can barely set a timer reliably.

The agents are brilliant. The bodies are braindead.

The Problem: Agents Are Blind, Deaf, and Numb

Claude can reason through a 50-page legal contract. GPT can generate production React components. OpenClaw can orchestrate multi-step workflows across dozens of APIs. But ask any of them what's happening in the room you're sitting in and they have absolutely no idea.

This isn't a minor limitation. It's the fundamental bottleneck. Every AI agent in existence is sensory-deprived.

They can't perceive without being told

Every piece of physical-world context must be manually narrated by the user. "It's raining." "I'm in the kitchen." "There's a package at the door." You are the sensor -- and you're an unreliable one. You forget to mention things, you don't think to share context, and you definitely don't narrate your environment in real-time. The agent is only as aware as you make it.

They can't act proactively on physical events

An agent can monitor your email or watch a stock price. It can poll any API. But it cannot notice your toddler walking toward the garage, smoke coming from the stove, someone at your door, or that you've been motionless at your desk for five hours. Zero physical-world triggers can reach them.

They can't maintain continuous context

Every interaction is session-based. You open a chat, describe your situation, get help, close the chat. The agent has no ambient understanding of your life. It doesn't know you just had a tense phone call, that your house is freezing, that three friends came over, or that you've been practicing guitar for two hours. Every conversation starts from zero physical context.

They can't understand what words can't describe

"What's wrong with this circuit board?" "Is this rash something to worry about?" "What bird is making that sound?" Some questions are inherently visual, auditory, or spatial. Today you photograph it, upload it, type a prompt, wait. The friction kills the use case -- most people just don't bother.

"The biggest limitation of AI agents today isn't intelligence -- it's input bandwidth. They're incredibly smart but nearly blind and deaf."

What Embodiment Actually Unlocks

The shift from disembodied to embodied isn't incremental. It changes the category of problem an agent can solve.

1. From Reactive to Proactive

Today: you ask a question, the agent answers. With embodiment: the agent notices things and acts before you ask. It sees the FedEx truck pull up. It hears the baby crying in the next room. It detects CO2 levels spiking because you've had the windows closed all day. The agent becomes the one who initiates.

2. From Session-Based to Ambient

Today: context lasts as long as the conversation. With embodiment: the agent has a persistent model of your physical environment. It knows who's home, what room you're in, what the temperature is. When you talk to it, it already has context. "Turn down the heat" works because it already knows which room you're in.

3. From Text-In/Text-Out to Multimodal

Today: you type or dictate, agent responds with text. With embodiment: you hold up a component and ask "what is this?" You play a chord and ask "what am I playing?" You point at a weed and ask "should I pull that?" Interaction bandwidth goes from ~40 words per minute to the full richness of the physical world.

4. From Isolated to Spatially Aware

Today: agents have zero concept of physical space. With embodiment and multiple sensor nodes: the agent knows your home's layout, which rooms are occupied, where sounds originate. It coordinates actions across spaces. "Someone's at the front door" isn't a notification you configured -- it's something the agent just knows.

Everything That's Been Tried -- and Why It Failed

The market has attempted embodied AI multiple times. Every attempt has been vertically integrated: one company's hardware running one company's AI. The results range from disappointing to catastrophic.

The Smart Speaker Era: $10 Billion and Counting

Amazon has reportedly lost over $10 billion on Alexa since its launch.[1] Despite 40% of US households owning a smart speaker,[2] usage patterns are stubbornly shallow: 75% use them for music, and the most common voice command is still "set a timer."[3] Alexa+ -- Amazon's 2025 generative AI upgrade ($19.99/month or free with Prime) -- promised agentic capabilities and natural conversation. User reception has been mixed at best, with XDA Developers bluntly noting that "Alexa+ didn't revolutionize anything in 2025."[4]

The problem isn't the form factor. It's that these devices are voice-in/voice-out pipes to a locked cloud. They have microphones and speakers. That's it. No camera, no environmental sensing, no presence detection, no expandable sensor stack. And the AI is permanently welded to the hardware: you get Amazon's agent, running Amazon's models, in Amazon's cloud. When the agent is mediocre, the hardware is worthless.

40%
US households with smart speakers[2]
75%
Users who mainly play music[3]
$10B+
Amazon's cumulative Alexa losses[1]

The AI Gadget Graveyard: Humane Pin & Rabbit R1

2024 produced two high-profile attempts to give AI agents physical form. Both were disasters.

Humane AI Pin ($699 + $24/month) projected its interface onto the user's palm, required a monthly subscription, and was returned more than it was sold. Of the units shipped, returns exceeded $1 million against $9 million in total sales, with 7,000 units sitting in unsellable inventory.[5] Fire risk concerns meant returns couldn't even be refurbished.

Rabbit R1 ($199) was a dedicated hardware device for an AI agent that could have been an app. Tom's Guide called it "one of the worst gadgets I've ever reviewed."[6] Unreliable voice commands, inaccurate answers, 4-hour battery life, and a security flaw that exposed user data through hardcoded API keys.[7] Combined, the two products lost over $5 billion in market value within 12 months.[8]

Why they failed: Both were single-purpose devices running single-purpose agents with no sensor stack beyond a basic camera and mic. They competed with the phone in your pocket and lost -- because a phone has more sensors, more connectivity, and better AI through cloud APIs. The lesson: a dedicated AI device only wins if it can do something a phone fundamentally cannot.

Meta Ray-Ban Smart Glasses: Right Direction, Wrong Model

Meta's partnership with EssilorLuxottica has produced the most commercially successful AI hardware product of this generation. Ray-Ban Meta glasses sold 2 million units with sales tripling year-over-year.[9] Production is scaling toward 10 million units annually by late 2026,[10] and the glasses outsell traditional Ray-Bans in 60% of EMEA retail stores.[11] Global smart glasses shipments surged 110% in H1 2025, with Meta capturing 73% market share.[12]

This validates consumer appetite for AI-in-the-physical-world. People want it. But the glasses are locked to Meta AI, processed in Meta's cloud, with no developer SDK and no ability to run a different agent. When Meta's AI is good, the glasses are great. When it's not, you're stuck. And the form factor -- great for mobile first-person context -- can't do what a stationary hub can: persistent room monitoring, environmental sensing, multi-room coordination, always-on spatial awareness.

Home Assistant: Right Philosophy, Missing Hardware

Home Assistant -- the open-source, self-hosted smart home platform run by the Open Home Foundation -- is the closest thing to a working model for agent-agnostic physical AI.[13] It integrates with over 1,000 devices, runs locally, supports OpenAI/Claude/Ollama for LLM-powered automation, and its "Year of the Voice" initiative produced open-source voice hardware. The 2025.8 release added "AI Tasks" for structured agent delegation.[14]

But Home Assistant stitches together third-party hardware that was never designed for AI agent use. A Zigbee motion sensor from 2019, a Ring camera talking to Ring's cloud, an ESP32 mic module with a 3-meter pickup range. There is no coherent, optimized sensor stack purpose-built for giving AI agents rich physical-world perception. The software intelligence is there. The hardware body is a Frankenstein.

The gap, restated: No product gives AI agents a rich, multi-sensor physical body with an open SDK. Smart speakers are deaf and blind. AI gadgets died because they brought nothing a phone couldn't do. Smart glasses are locked to one vendor. Home Assistant has the right software architecture but jury-rigged hardware. The opportunity is building the missing body.

Competitive Landscape: Head-to-Head

Every player in this space has made the same mistake: welding the intelligence to the hardware. The table below compares sensor coverage, agent flexibility, privacy posture, and developer access.

Product Agent Flexibility Sensor Stack Privacy Model Dev SDK Verdict
Amazon Echo / Alexa+
$19.99/mo or Prime
× Alexa only. Skills are sandboxed wrappers, not real agent access. Mic array + speaker. Some models add a screen. No camera on most. No environmental sensors. Cloud only
Poor track record. Recordings sent to Amazon.
Skills Kit (limited) Mediocre
Google Nest / Home
Free (Google account)
× Google Assistant only. Actions deprecated; Gemini pivot underway. Mic + speaker. Nest Hub adds camera + temperature sensor. Limited sensing. Cloud only
Data feeds Google's ad model.
Actions on Google (being sunset) Mediocre
Apple HomePod
$299
× Siri only. SiriKit is extremely restricted. Mic + speaker + temperature + humidity. No camera, no presence. Best-in-class
On-device processing. Apple's privacy commitment.
SiriKit (very limited) Mediocre
Meta Ray-Ban Glasses
$299-$799
× Meta AI only. No third-party agents. No SDK. Camera + mic + speaker. No environmental sensors. Mobile only. Cloud only
All processing in Meta's cloud.
None Good hardware, locked ecosystem
Rabbit R1 / Humane Pin
$199 / $699+$24/mo
× Custom locked agent. R1 = LAM. Pin = Cosmos. Camera + mic + speaker. No environmental sensors. No expansion. Cloud only
R1 had hardcoded API key leak.
None Failed products
Home Assistant
Free / $65 (Green hub)
Any LLM. OpenAI, Claude, Ollama. Full automation engine. Depends on third-party hardware. No unified sensor stack. Fragmented. Local-first
Self-hosted. No cloud dependency.
Full API + add-ons Right software, no hardware
Embodied (this)
$149-199 target
Any agent. Claude, GPT, OpenClaw, custom. Full SDK. Camera + 4-mic array + speaker + mmWave radar + temp/humidity/AQ + LED ring. Expandable modules. Hardware-enforced
Kill switches. Local-first. Edge compute.
Full open SDK The missing piece

The pattern is obvious: every product either has good hardware with a locked agent or agent flexibility with no hardware. Nobody has built an open, sensor-rich hardware platform that any AI agent can plug into.

The Product

A family of hardware devices -- starting with a single hub -- that serve as the physical body for any AI agent. The hardware is open. The SDK is open. The agent is whatever you want it to be.

Core Device: The Puck

A compact, desk/shelf-mountable hub (~palm-sized) containing the essential sensory stack for physical-world AI perception.

Concept render: The Puck -- a compact smart home hub for AI agents

Concept render: The Puck -- a palm-sized hub with camera, mic array, speaker, and ambient sensors.

Perception

  • Wide-angle camera -- 160° FOV, 1080p, low-light capable
  • Microphone array -- 4-mic beamforming, directional audio, far-field pickup (5m+), noise cancellation
  • Ambient sensors -- temperature, humidity, ambient light, air quality (VOC/CO2)
  • Presence detection -- mmWave radar for occupancy without camera (privacy-preserving)

Output & Compute

  • Speaker -- full-range driver for voice, alerts, ambient audio
  • LED ring -- programmable RGB for status, attention, ambient indicators
  • Edge SoC -- Qualcomm QCS8550 class. Local ASR, basic vision, wake-word. 64GB+ storage for on-device model weights
  • Connectivity -- Wi-Fi 6E, BLE 5.3, Thread/Matter, USB-C

Expansion: Sensor Modules

Wireless add-on modules that extend the Puck's sensory range. Each module pairs over BLE and auto-registers capabilities with the SDK.

Wearable Form Factor (V2)

Glasses or clip-on form factor giving agents first-person visual perspective + always-on audio. Think Meta Ray-Bans but agent-agnostic. V1 focuses on the stationary hub -- the wearable is a harder problem (battery, thermals, optics) and benefits from proving the SDK first.

Use Cases

Each of these is impossible for a disembodied agent. That's the bar -- if an agent could do it through a chat window, it doesn't belong here.

Use Case 1

The Kitchen Agent

Camera and environmental sensors in the kitchen. Agent watches while you cook.

What the agent gains from embodiment
  • Sees you pull out chicken, garlic, and lemons. Proactively suggests a recipe without being asked -- because it saw the ingredients.
  • Hears the timer go off in the other room when you've stepped away. Calls out "your oven timer just went off" through the speaker.
  • Detects air quality dropping (smoke/VOC spike). Says "you might want to open a window" before the smoke alarm triggers.
  • You hold up a spice jar you can't read and say "what is this?" Agent reads it visually.

Without embodiment: The agent would need you to type "I'm cooking chicken with garlic and lemons, what should I make?" The proactive, ambient nature disappears entirely. You'd never think to tell the agent the air quality is dropping -- you don't know it yourself.

Use Case 2

The Developer Desk

Camera pointed at your workspace. Mic always listening (wake-word activated). Presence detection tracks focus time.

What the agent gains from embodiment
  • After a meeting ends, it heard the discussion and says: "You committed to sending the API spec to Jake by Thursday. Want me to draft it?"
  • You've been at your desk for 4 hours without standing (presence sensor, no camera needed). Gentle audio nudge: "You've been going for a while."
  • You're debugging and say "what the hell is this error" while looking at your screen. The agent sees the error via camera and tells you the fix through the speaker. No copy-paste, no screenshots, no context-switching.
  • Detects you're on a call (mic pattern) and silently holds all non-urgent notifications.

Without embodiment: The agent doesn't know you're in a meeting, can't see your screen, doesn't know you've been sitting for hours. You'd have to switch tabs, paste the error, describe the context. The friction means you just Google it instead.

Workshop companion: hands-free AI assistance in a woodworking shop

The Workshop Companion: when your hands are covered in sawdust, a chat window isn't an option.

Use Case 3

The Workshop Companion

Ruggedized camera module in a garage, studio, or workshop. Mic for hands-free interaction.

What the agent gains from embodiment
  • You're assembling furniture from a manual. Agent watches via camera, tracks which step you're on, warns you before you skip a step or install something backwards.
  • "What gauge is this wire?" -- you hold it up, the agent measures it visually against known references.
  • Soldering a circuit board: "Is this joint good?" Agent examines it through the camera, identifies a cold solder joint.
  • For musicians: listens to your practice, notices you keep stumbling on the same passage. "Bars 12-16 seem to be giving you trouble -- want to try it at half speed?"

Without embodiment: Hands are occupied. You're covered in sawdust, solder, or paint. Picking up a phone isn't just inconvenient -- it breaks your flow and sometimes isn't physically possible. Embodiment is highest-value when the user's hands are busy.

Use Case 4

The Home Awareness Layer

Multiple sensor nodes across a home. Presence detection, environmental sensors, door/window contacts. Camera optional per room.

What the agent gains from embodiment
  • Knows which rooms are occupied. Routes audio to the right room -- say "play that podcast" from the bedroom and it plays there, not the kitchen.
  • Detects you left a window open and it's about to rain (weather sensor + window contact). Alerts you.
  • Notices the elderly parent who usually makes coffee at 7am hasn't triggered any presence sensors by 9am. Sends a check-in alert.
  • You're leaving and say "am I good?" Agent checks: doors locked, windows closed, stove off, thermostat set back, lights off. "All good."

Without embodiment: The agent has zero knowledge of your home's physical state. Can't know which doors are open, who's home, whether the stove is on. You check everything yourself -- which is exactly what you do today.

Use Case 5

The Accessibility Agent

Full sensory stack optimized for users with visual impairments, mobility limitations, or cognitive disabilities. This is where embodiment goes from nice-to-have to the entire value proposition.

What the agent gains from embodiment
  • Visually impaired user holds up mail: "What is this?" Agent reads it aloud, identifies it as a medical bill, offers to pay it.
  • Camera-based spatial navigation: "The chair is about 3 feet to your left."
  • Medication reminders that confirm via presence detection -- not just "time for your meds" on a schedule, but verifying you actually got up and went to the medication. If no movement in 10 minutes, reminds again.
  • Fall detection via mmWave radar (no camera -- privacy preserving). Auto-alert to emergency contacts if no movement detected for 60 seconds after a fall pattern.

Without embodiment: These users often can't easily use phones, tablets, or keyboards. The voice + vision interface IS the product. There's no "just use the chat app" fallback.

Use Case 6

Security and Property

Multi-node deployment across a property. Each unit runs a security-focused agent with camera and presence sensing. All processing local.

What the agent gains from embodiment
  • Person at front door. Agent identifies via on-device face recognition. Known: unlocks the door. Unknown: asks "Can I help you?" and streams to your phone.
  • Multi-node coordination: "Person detected at front door, now moving toward the side gate." Tracks across camera zones.
  • Distinguishes person from animal from falling branch. No false alarms at 3am because a raccoon walked past.
  • You say "I'm leaving for a week." Agent switches to away mode: tighter thresholds, automated light cycling for simulated occupancy, daily summaries to your phone.

Without embodiment: A disembodied agent can read your Ring camera's notification -- but it can't decide whether to unlock the door, track someone across angles, or reason about a scene in real time. The intelligence layer is missing from existing security hardware.

Why This Wins

Agent-Agnostic

Every competitor locks hardware to one AI. We don't care which agent you run. Claude, GPT, OpenClaw, Ollama, your custom stack -- the hardware works with all of them through the SDK. When a better model ships next quarter, you swap the brain, not the body.

Developer-First

The SDK is the product, not the device. Every agent developer in the world becomes a potential customer when their agent can gain a physical body through a standard API. We build for the people building agents, not just the people using them.

Privacy as Architecture

Hardware kill switches for camera and mic. On-device processing for wake-word, basic ASR, and presence. No cloud dependency for core sensing. In a world where 65% of consumers report discomfort with always-on devices, local-first isn't just ethical -- it's a competitive moat.

Open Ecosystem = Platform Flywheel

Open SDK means third-party sensor modules, community-built agent templates, shared perception models. More agents built for the platform → more hardware sold. More hardware deployed → more agents built. Home Assistant proved this flywheel works in software. We bring it to hardware.

The Smartphone Analogy

Before the iPhone, phones were vertically integrated: Nokia's hardware ran Nokia's software ran Nokia's apps. The iPhone (and later Android) separated the hardware platform from the application layer. Suddenly any developer could build for the hardware. The app ecosystem -- not the hardware itself -- became the value.

The shift: from talking to a speaker to ambient spatial intelligence

The shift: from talking at a speaker to ambient spatial intelligence.

AI agents are in the "Nokia era" right now. Alexa's hardware runs Alexa's agent. Meta's glasses run Meta's agent. Google's hub runs Google's agent. Each vendor's intelligence is welded to their body.

But the intelligence is commoditizing. Open-source models close the gap with frontier models every quarter. Claude, GPT, Gemini, Llama, Mistral -- the brain is becoming interchangeable. What's not interchangeable is the body. A camera in your kitchen. A mic array in your workshop. A presence sensor in every room. The physical layer is the scarce resource.

We're building the platform that separates the body from the brain -- so any brain can have a body.

Go-to-Market

Phase 0: Dev Kit
Reference board + SDK alpha. Raspberry Pi CM4/5 carrier with mic array, camera module, speaker. Open-source hardware design files. Target: 500+ developers building embodied agents. Iterate APIs on real usage.
Phase 1: Puck v1
Consumer-grade hardware. Industrial design, production SoC, optimized sensor stack. $149-199 price point. First 10K units. Establish retail presence and reviews.
Phase 2: Sensor Ecosystem
Expansion modules + open third-party hardware spec. Revenue from first-party modules + certification for third-party. Platform lock-in through ecosystem breadth.
Phase 3: Wearable
Clip-on or glasses form factor. First-person camera + always-on mic + bone conduction speaker. $299-399 premium SKU. Agents that move with you.

Technical Risks & Open Questions

Edge vs. Cloud Latency

On-device ASR (Whisper tiny/small) is fast but lower quality. On-device vision is limited. How much must go to the cloud? How do we preserve privacy guarantees for cloud-processed requests?

Power Budget

Always-on sensing drains power. Stationary Puck is wall-powered (solved). Wearable is the hard problem -- probably 4-6 hours active with current battery tech.

Multi-Agent UX

When three agents run on your Puck, how does the user know which is talking? Which is listening? LED ring helps but the interaction model needs careful design.

Agent Quality Variance

Open platform means some agents will be bad. A poorly built agent that spams the speaker hurts the hardware brand. Need permissions + review without becoming an app store gatekeeper.

Agentic Loop Latency

Perceive (camera) → process (cloud LLM) → act (speaker). Round trip: 2-5 seconds. Too slow for conversation. Must identify which loops stay on-device vs. which tolerate latency.

BOM at $149

Camera + 4-mic array + speaker + radar + environmental sensors + Wi-Fi 6E + BLE + SoC + 64GB. Tight but feasible at scale. Dev kit uses off-the-shelf at higher unit cost; production Puck benefits from custom PCB and volume.

What We're Really Building

"We build the physical body. You build the mind."

The embedded AI market is projected to reach $48.9 billion by 2034, growing at 17.5% CAGR.[15] Edge AI hardware alone will hit $36.4 billion by 2033.[16] The consumer AI autonomous agent market -- devices that orchestrate cross-device interactions -- is growing from $458M in 2026 to $833M by 2032.[17]

But none of these projections account for what happens when the body and the brain are decoupled -- when any developer building an AI agent can give it eyes, ears, and spatial awareness through a $149 device with a standard SDK. That's not an incremental market. It's a platform shift.

The intelligence is commoditizing. The physical layer is the moat. We own the body.

Citations

[1] Emarketer. "Google Assistant leads U.S. voice assistant adoption... Amazon's $10 billion loss from Alexa's decline." emarketer.com

[2] ElectroIQ. "35% of U.S. adults own smart speakers... global market reached $21.4B in 2025, 156M units shipped." electroiq.com; US Smart Speaker Market Research (MRFR) reports 40% household ownership.

[3] Digital Trends. "75% of smart speaker users primarily use them for music... 43% have made purchases." digitaltrends.com

[4] XDA Developers. "Alexa+ and Google Home's AI didn't revolutionize anything in 2025, but Home Assistant did." xda-developers.com

[5] MacRumors. "Returns of Humane AI Pin outpacing sales: $9 million in sales, $1 million in returns, 7,000 units unsold." macrumors.com

[6] Tom's Guide. "The Rabbit R1 is one of the worst gadgets I've ever reviewed." tomsguide.com

[7] SafeWise. "Rabbit R1 security flaw: hardcoded API keys allowing unauthorized access to user data." safewise.com

[8] Digital Applied. "AI Product Failures 2026: Sora, Humane & Rabbit R1 -- combined $5 billion+ in losses within 12 months." digitalapplied.com

[9] UploadVR / Entrepreneur. "Ray-Ban Meta glasses sales tripled year-over-year; 2 million units sold as of 2025." uploadvr.com

[10] Reuters. "Meta and EssilorLuxottica considering doubling production to 20M units annually; 10M-pair target by late 2026." reuters.com

[11] TechCrunch. "Meta's smart glasses outsell traditional Ray-Bans in 60% of European, Middle Eastern, and African stores." techcrunch.com

[12] Counterpoint Research. "Global smart glasses shipments surged 110% YoY in H1 2025, Meta capturing 73% market share." counterpointresearch.com

[13] Home Assistant. "Open-source home automation platform, 1,000+ device integrations, local-first." home-assistant.io

[14] Home Assistant. "2025.8: The Summer of AI -- AI Tasks for structured task delegation." home-assistant.io

[15] Fortune Business Insights. "Global embedded AI market projected to grow from $13.49B in 2026 to $48.90B by 2034 at 17.5% CAGR." fortunebusinessinsights.com

[16] LinkedIn / Market Research. "Edge AI Hardware Market projected to grow from $12.5B in 2024 to $36.4B by 2033 at 13.4% CAGR." linkedin.com

[17] Research and Markets. "Consumer Electronics AI Autonomous Agent Market -- $458.46M in 2026 to $833.21M by 2032." researchandmarkets.com

[18] MarketsandMarkets. "Global embodied AI market projected to grow from $4.44B in 2025 to $23.06B by 2030 at 39.0% CAGR." marketsandmarkets.com

Draft v0.2 · May 2026