Voice AI Platform · Meta Ray-Ban

Audio Notes

3-Month Summary — April to June 2026
123 recordings 13 recording days 198 action items Apr 16 – Jun 10 Captured on Ray-Ban Meta

From Prototype to Beta: The Voice AI Sprint

Over three months, the Voice AI Platform team drove Meta Ray-Ban smart glasses from early Live Notes prototyping through dogfooding milestones and into beta launch readiness. The arc began in mid-April with foundational product decisions—how to navigate wiretapping laws with a manual-upload flow, how to make "Hey Meta, start live notes" feel seamless—and ended in June with a major strategic decision: launching the beta on schedule despite recommendation relevance sitting at 72%, thirteen points below the 85% target.

The defining moment came during the first week of June, an intense three-day sprint (June 2–4) that produced 83 of the 123 total recordings. The team debated the beta launch question from every angle—Saurabh's quality concerns versus Pooja's launch-window argument—and arrived at a pragmatic compromise: geofence the beta to major cities for the first two weeks while silently deploying improved model weights. This decision preserved market timing while protecting user trust in locations where the model performs well.

In parallel, the team built the speaker identification pipeline from scratch—collecting voice embeddings, debugging diarization in live meetings, designing per-user voice libraries with Manifold persistence—while simultaneously investigating a Live Notes outage, pushing TE reliability from 75% toward 86%, and planning the M1/M2 milestone roadmap that will carry the product into 2027.


3 Months at a Glance

123
Recordings
13
Active Days
198
Action Items
83
Notes in Jun 2–4
10+
Key Speakers
86%
TE Reliability

Phase by Phase

Phase 1 · Product Foundation
April 16–17 · 21 sessions

The team shaped the core Live Notes experience through two days of intensive testing and product discussions. The central challenge: wiretapping laws prevent live-streaming audio to a server, so the team designed a manual-upload mitigation where audio is stored on-device and uploaded by the user post-session.

  • Defined the product flow: "Hey Meta, start live notes" → on-device recording → manual upload → AI summary + action items
  • Tested iOS and Android builds — iOS working, Android hitting issues after recent changes
  • Prepared demo logistics for the "Better Together" room with Lawrence from MG
  • Identified intermittent ASR slowness on iOS (suspected client-side)
  • Voice command testing for "stop live notes" revealed intermittent reliability and sequential-stop handling edge cases
Live Notes Wiretapping Mitigation Manual Upload iOS/Android Testing Demo Prep
Phase 2 · Quality & Data Pipeline
April 23 · 3 sessions · 17 action items

Focus shifted to measurement and data quality. The team confronted poor speaker distribution metrics beyond three speakers and decided to build a golden dataset for annotation. Model performance comparisons between Gemini and Avocado showed similar results, but reliability metrics were falling short.

  • Created golden dataset strategy for speaker distribution quality improvement
  • Addressed FLM routing issues and training data gaps
  • Discussed audio upload UX: batch processing, no auto-deletion, title/summary generation
  • Identified timestamp issues and summarization failures needing fixes
  • Started ramping up TE debugging strategy
Speaker Distribution Golden Dataset Model Evaluation Audio Upload UX
Phase 3 · DF7 Planning & Stakeholder Management
April 29 – May 4 · 7 sessions · 43 action items

The team navigated dual pressures: a tight DF7 timeline (code complete by 5/12, dogfood by 5/19) and stakeholder skepticism about product quality. A comprehensive reliability document was commissioned to address concerns, while technical limitations of the 9B model's 16K context window threatened summary quality for long transcripts.

  • DF7 timeline set: ~10 days of coding before stabilization and dogfooding
  • 9B model context window (16K) identified as risk for large transcripts
  • Audio reliability solution: phone-based storage with retry mechanism, in-memory caching on Nemo
  • E2E tests: Android partially ready, iOS blocked by missing test harness
  • Stakeholder one-pager commissioned: datasets, evaluation methods, historical issue resolution
  • Dogfooding bar defined as "good enough for signal" — less stringent than DF6 due to new infrastructure
"We have to work with the possibility that some of these things may take longer… We only have maximum 10 days for coding, and after that we will have to go stabilise."
DF7 Timeline 9B Model Stakeholder Document E2E Testing Reliability
Phase 4 · Ship v0 & Architecture Deep Dive
May 27–31 · 8 sessions · 10 action items

After weeks away, the team made a decisive call: stop debating product questions and ship the existing end-to-end v0 immediately, even with rough design. Meanwhile, a deep dive into OpenAI's WebRTC architecture (for ChatGPT Voice and Realtime API) informed the team's own real-time audio infrastructure decisions.

  • Ship decision: Consensus to push v0 to group chat for dogfooding immediately — "strip non-essential UI, test the code that's already written"
  • Live Notes status review with Anne: live transcripts shown despite poor ASR per product decision; summary quality prioritized over semantic endpointing
  • Session success rate: 75% with summary, 5 points below 80% target. Logging gaps identified — Falco logger tracking E2E but not in dashboard
  • Reliability fixes landed in build B276: RP connection drops, Wi-Fi/cellular switching, VIP fallback
  • WebRTC architecture study: OpenAI's relay-plus-transceiver design, stateless relay forwarding to stateful transceivers, single Go service for signaling + media termination
"The team agreed to stop debating product questions and ship the existing end-to-end v0 as soon as possible, even with a rough design."
Ship v0 WebRTC Architecture Reliability: 75% Build B276 OpenAI Study
Phase 5 · The Big Push
June 2–4 · 83 sessions · 120 action items

The defining three days of the quarter. The team tackled beta launch readiness, the speaker identification pipeline, a Live Notes outage, and multiple concurrent workstreams at a pace that produced two-thirds of all recordings in the entire period.

The Beta Launch Debate

The dominant thread: recommendation relevance at 72% versus the 85% target. This was debated across dozens of sessions, with Saurabh advocating for a delay to protect user trust, and Pooja pushing to maintain the launch window with a 91% itinerary satisfaction score as evidence the core product works.

The compromise: Launch on schedule with a major-cities-only geofence for the first two weeks. Saurabh's team silently deploys improved model weights while Pooja implements the geofence and UI restriction. This preserved market timing while constraining exposure to regions where the model performs well.

Speaker Identification Build

Parallel to the launch debate, the team built and debugged the speaker ID pipeline in live sessions:

  • Collected voice embeddings from team members (Colin, Alex, Saurabh, Pooja, Dan)
  • Debugged misidentification issues — system detected correct speaker count but tagged wrong names
  • Designed per-user voice library: centroid + rolling sample embeddings, wearer vs bystander distinction
  • Agreed on in-memory storage with Manifold persistence for ~50 tags per user
  • DF4 integrity filtering planned, date moved from June 11 to June 16

Live Notes Outage & Reliability

A Live Notes outage starting May 31 was traced to a Sunday routing change with secondary effects on SMC Service Discovery tiers. Reverting didn't immediately fix it. Final summaries failed ~50% of the time.

  • Self/other diarization fix: training data merge had caused self speech to flip to "other" in silence regions — corrected with retrained model and updated VAD
  • TE reliability reached 86.11% with two failure modes remaining

Connect Launch Recap

Alex presented the successful Connect keynote: Mark's live handwriting demo at 30 wpm, Vanguards on the skate ramp, and DC Rainmaker's review titled "finally smart glasses that don't suck." Glasses tracking to 4–5M units for 2025 (below 10M target but 3x prior year).

Beta Launch Decision 72% vs 85% Relevance Geofence Compromise Speaker ID Pipeline Live Notes Outage TE Reliability 86% Connect Recap
Phase 6 · Looking Ahead
June 10 · 1 session

The most recent recording maps out the M1/M2 milestone roadmap with Alex, setting the direction that carries the product into 2027.

  • Milestone 1: Mirror 2.0 architecture — on-device handling for forms, everything else routed through Hatch
  • Milestone 2: Open questions around Instagram and WhatsApp messaging integrations, privacy-versus-utility tradeoffs
  • 2027 Decision: Hatch-only vs hybrid routing — informed by M2 learnings on messaging and privacy
M1/M2 Roadmap IG/WhatsApp Integration Privacy vs Utility 2027 Strategy

The Major Threads

Beta Launch: The 72% vs 85% Debate

The single most discussed topic across all recordings. Recommendation relevance sat at 72% against an 85% target, creating a genuine strategic tension between quality and timing.

Saurabh's position: Launching below target risks early negative reviews and lost user trust, especially for niche destinations where the model struggles. Proposed a two-week delay to integrate augmented training data.

Pooja's position: The core itinerary experience shows 91% user satisfaction. A clear launch window exists. Launch-blocking reliability issues (budget overflow bug, stale hotel price caching) are higher priority than incremental relevance gains.

Resolution: Launch on schedule with a temporary geofence restricting the beta to major cities for the first two weeks. Saurabh's team packages improved model weights for silent deployment. Pooja specs out the geofencing logic and temporary UI flag.

Speaker Identification & Diarization

A technical thread that ran from April through June, evolving from basic measurement problems to a working prototype:

  • April: Poor speaker distribution metrics beyond 3 speakers; golden dataset commissioned
  • May: Diarization only working in factory settings; iOS E2E tests blocked
  • June 2–4: Live debugging sessions with 6+ test recordings; embedding system built with centroid + rolling samples; misidentification issues traced to noisy environments and incorrect tagging
  • Storage: Lightweight in-memory with Manifold persistence for ~50 tags per user, after Warm Storage stopped onboarding
  • Self/other fix: Training data merge caused self speech to flip in silence regions — retrained model + updated VAD resolved it

Live Notes Reliability

The team tracked reliability from 75% to 86%+ over the quarter:

  • Session success rate tracking gaps: logging only started after Nemo server connection, missing client-side failures
  • May 31 outage: routing change → secondary effects on SMC Service Discovery → final summaries failing ~50%
  • Build B276 fixes: RP connection drops, Wi-Fi/cellular switching, VIP fallback
  • TE reliability reached 86.11% by early June with two failure modes remaining
  • ASR quality: product decided to show live transcripts despite poor performance; summary quality prioritized over semantic endpointing

Dogfooding Milestones (DF6 → DF7)

The team managed aggressive dogfooding timelines while balancing quality expectations:

  • DF6: Established quality bar; poor logging skewed reliability numbers despite product working ("vibe check" exit criteria)
  • DF7: Code complete target 5/12, dogfood 5/19; acknowledged exit criteria won't be as clean as DF6
  • DF4 (Integrity): Filtering work planned for June 11–16; needed alignment from Alyssa's team
  • Ambient audio dogfooding set for June 17 with Nemo path on June 9
  • Key shift: "Stop debating, ship v0" — consensus to prioritize getting signal over polish

Architecture & Infrastructure

Several architectural decisions shaped the platform's direction:

  • Wiretapping mitigation: Manual upload flow as legal-safe path; on-device audio storage with phone-based retry mechanism
  • WebRTC study: Deep dive into OpenAI's ChatGPT Voice architecture — relay-plus-transceiver design, Go service handling signaling + media termination, one-port-per-session tradeoffs
  • M1/M2 roadmap: M1 mirrors 2.0 with on-device forms handling; M2 explores Instagram/WhatsApp integrations with privacy-utility tradeoffs informing 2027 Hatch-only vs hybrid routing decision
  • Open server adoption: Team discussed operating more as a backend team while supporting experience team on first-party apps

Key Deliverables

Consolidated and deduplicated from 198 raw action items across all sessions. Grouped by workstream.

Beta Launch Execution
Pooja — Implement geofence and temporary UI flag to restrict beta to major cities only for first 2 weeks
Assigned June 2–4 · Spec by EOD, implementation to follow
Saurabh — Package newly trained model weights for silent deployment within first week of beta launch
Assigned June 2–4 · Start immediately with team
Team — Fix budget overflow bug (model suggests options exceeding stated budget)
Assigned June 2 · Launch blocker
Team — Fix caching bug (stale hotel prices in ~10% of searches)
Assigned June 2 · Launch blocker
Raunaq — Augment training data with curated travel blogs and local tourism content; updated weights by end of next week
Assigned June 2
Speaker ID & Diarization
Colin — Assign embedding module update work to Jigar
Assigned June 3
Alex — Check if improvements can be made to auto speaker code; ping group when ready
Assigned June 3
Team — Get embeddings inside PE (legal won't allow putting it on open source)
Assigned June 4
Colin — Release new VAD with fixed hyperparameters (by EOD June 3)
Assigned June 3
Live Notes & Reliability
Pooja — Trace routing change that caused secondary effects on SMC tier and Service Discovery
Assigned June 2
Colin — Ask internal team to use staging gear with more logging for remaining error buckets
Assigned June 3
Raunaq — Follow up on contextual biasing work (blocked by design)
Assigned May 27
Raunaq — Speak with Honey about iOS client-side work and getting Limitless team to help with reliability
Assigned May 27
Documentation & Planning
Raunaq — Create half-pager: data model for pending/tagged audio snippets, insert/update flows
Assigned June 3 · Send to team immediately
Colin & Raunaq — Wrap up and share doc with team (Ahmed requesting it)
Assigned June 3 · By tomorrow
Raunaq — Re-engage legal team regarding data storage; consult with Nemo team re: Everstore
Assigned April 29

The Team

Most frequently appearing speakers across all recordings.

Raunaq
Voice AI Platform Lead · Senior Staff Engineer
Present in 80+ sessions
Pooja
Product / UX · Beta launch, geofencing, UI
36 sessions
Saurabh
ML / Model Quality · Recommendation relevance
28 sessions
Colin
Speaker ID / Diarization · VAD, integrity filtering
17 sessions
Alex
Platform & Integrations · Connect launch, Apple
12 sessions
Dan
Engineering · Audio pipeline, code changes
12 sessions
Jigar
Engineering · Embedding module updates
4 sessions
Anne
Live Notes · ASR quality, transcript visibility
Referenced in key reviews
Brad
ML · 9B model experiments, segmentation
Referenced in DF7 planning
Mathias & Dan
Stakeholders · DF7 dogfooding bar alignment
Referenced in stakeholder mgmt

When You Recorded

A heatmap of recording intensity across the quarter. June 2–4 stands out as the most intense working period.

April
5
16
3
4
May
1
2
2
1
5
June
21
24
38
1
Each block represents one recording day. Color intensity reflects session volume.