Audio Notes — 3-Month Summary (April

Executive Overview

From Prototype to Beta: The Voice AI Sprint

Over three months, the Voice AI Platform team drove Meta Ray-Ban smart glasses from early Live Notes prototyping through dogfooding milestones and into beta launch readiness. The arc began in mid-April with foundational product decisions—how to navigate wiretapping laws with a manual-upload flow, how to make "Hey Meta, start live notes" feel seamless—and ended in June with a major strategic decision: launching the beta on schedule despite recommendation relevance sitting at 72%, thirteen points below the 85% target.

The defining moment came during the first week of June, an intense three-day sprint (June 2–4) that produced 83 of the 123 total recordings. The team debated the beta launch question from every angle—Saurabh's quality concerns versus Pooja's launch-window argument—and arrived at a pragmatic compromise: geofence the beta to major cities for the first two weeks while silently deploying improved model weights. This decision preserved market timing while protecting user trust in locations where the model performs well.

In parallel, the team built the speaker identification pipeline from scratch—collecting voice embeddings, debugging diarization in live meetings, designing per-user voice libraries with Manifold persistence—while simultaneously investigating a Live Notes outage, pushing TE reliability from 75% toward 86%, and planning the M1/M2 milestone roadmap that will carry the product into 2027.

By the Numbers

3 Months at a Glance

123

Recordings

13

Active Days

198

Action Items

83

Notes in Jun 2–4

10+

Key Speakers

86%

TE Reliability

Timeline

Phase by Phase

Phase 1 · Product Foundation

April 16–17 · 21 sessions

The team shaped the core Live Notes experience through two days of intensive testing and product discussions. The central challenge: wiretapping laws prevent live-streaming audio to a server, so the team designed a manual-upload mitigation where audio is stored on-device and uploaded by the user post-session.

Defined the product flow: "Hey Meta, start live notes" → on-device recording → manual upload → AI summary + action items
Tested iOS and Android builds — iOS working, Android hitting issues after recent changes
Prepared demo logistics for the "Better Together" room with Lawrence from MG
Identified intermittent ASR slowness on iOS (suspected client-side)
Voice command testing for "stop live notes" revealed intermittent reliability and sequential-stop handling edge cases

Live Notes Wiretapping Mitigation Manual Upload iOS/Android Testing Demo Prep

Phase 2 · Quality & Data Pipeline

April 23 · 3 sessions · 17 action items

Focus shifted to measurement and data quality. The team confronted poor speaker distribution metrics beyond three speakers and decided to build a golden dataset for annotation. Model performance comparisons between Gemini and Avocado showed similar results, but reliability metrics were falling short.

Created golden dataset strategy for speaker distribution quality improvement
Addressed FLM routing issues and training data gaps
Discussed audio upload UX: batch processing, no auto-deletion, title/summary generation
Identified timestamp issues and summarization failures needing fixes
Started ramping up TE debugging strategy

Speaker Distribution Golden Dataset Model Evaluation Audio Upload UX

Phase 3 · DF7 Planning & Stakeholder Management

April 29 – May 4 · 7 sessions · 43 action items

The team navigated dual pressures: a tight DF7 timeline (code complete by 5/12, dogfood by 5/19) and stakeholder skepticism about product quality. A comprehensive reliability document was commissioned to address concerns, while technical limitations of the 9B model's 16K context window threatened summary quality for long transcripts.

DF7 timeline set: ~10 days of coding before stabilization and dogfooding
9B model context window (16K) identified as risk for large transcripts
Audio reliability solution: phone-based storage with retry mechanism, in-memory caching on Nemo
E2E tests: Android partially ready, iOS blocked by missing test harness
Stakeholder one-pager commissioned: datasets, evaluation methods, historical issue resolution
Dogfooding bar defined as "good enough for signal" — less stringent than DF6 due to new infrastructure

"We have to work with the possibility that some of these things may take longer… We only have maximum 10 days for coding, and after that we will have to go stabilise."

DF7 Timeline 9B Model Stakeholder Document E2E Testing Reliability

Phase 4 · Ship v0 & Architecture Deep Dive

May 27–31 · 8 sessions · 10 action items

After weeks away, the team made a decisive call: stop debating product questions and ship the existing end-to-end v0 immediately, even with rough design. Meanwhile, a deep dive into OpenAI's WebRTC architecture (for ChatGPT Voice and Realtime API) informed the team's own real-time audio infrastructure decisions.

Ship decision: Consensus to push v0 to group chat for dogfooding immediately — "strip non-essential UI, test the code that's already written"
Live Notes status review with Anne: live transcripts shown despite poor ASR per product decision; summary quality prioritized over semantic endpointing
Session success rate: 75% with summary, 5 points below 80% target. Logging gaps identified — Falco logger tracking E2E but not in dashboard
Reliability fixes landed in build B276: RP connection drops, Wi-Fi/cellular switching, VIP fallback
WebRTC architecture study: OpenAI's relay-plus-transceiver design, stateless relay forwarding to stateful transceivers, single Go service for signaling + media termination

"The team agreed to stop debating product questions and ship the existing end-to-end v0 as soon as possible, even with a rough design."

Ship v0 WebRTC Architecture Reliability: 75% Build B276 OpenAI Study

Phase 5 · The Big Push

June 2–4 · 83 sessions · 120 action items

The defining three days of the quarter. The team tackled beta launch readiness, the speaker identification pipeline, a Live Notes outage, and multiple concurrent workstreams at a pace that produced two-thirds of all recordings in the entire period.

The Beta Launch Debate

The dominant thread: recommendation relevance at 72% versus the 85% target. This was debated across dozens of sessions, with Saurabh advocating for a delay to protect user trust, and Pooja pushing to maintain the launch window with a 91% itinerary satisfaction score as evidence the core product works.

The compromise: Launch on schedule with a major-cities-only geofence for the first two weeks. Saurabh's team silently deploys improved model weights while Pooja implements the geofence and UI restriction. This preserved market timing while constraining exposure to regions where the model performs well.

Speaker Identification Build

Parallel to the launch debate, the team built and debugged the speaker ID pipeline in live sessions:

Collected voice embeddings from team members (Colin, Alex, Saurabh, Pooja, Dan)
Debugged misidentification issues — system detected correct speaker count but tagged wrong names
Designed per-user voice library: centroid + rolling sample embeddings, wearer vs bystander distinction
Agreed on in-memory storage with Manifold persistence for ~50 tags per user
DF4 integrity filtering planned, date moved from June 11 to June 16

Live Notes Outage & Reliability

A Live Notes outage starting May 31 was traced to a Sunday routing change with secondary effects on SMC Service Discovery tiers. Reverting didn't immediately fix it. Final summaries failed ~50% of the time.

Self/other diarization fix: training data merge had caused self speech to flip to "other" in silence regions — corrected with retrained model and updated VAD
TE reliability reached 86.11% with two failure modes remaining

Connect Launch Recap

Alex presented the successful Connect keynote: Mark's live handwriting demo at 30 wpm, Vanguards on the skate ramp, and DC Rainmaker's review titled "finally smart glasses that don't suck." Glasses tracking to 4–5M units for 2025 (below 10M target but 3x prior year).

Beta Launch Decision 72% vs 85% Relevance Geofence Compromise Speaker ID Pipeline Live Notes Outage TE Reliability 86% Connect Recap

Phase 6 · Looking Ahead

June 10 · 1 session

The most recent recording maps out the M1/M2 milestone roadmap with Alex, setting the direction that carries the product into 2027.

Milestone 1: Mirror 2.0 architecture — on-device handling for forms, everything else routed through Hatch
Milestone 2: Open questions around Instagram and WhatsApp messaging integrations, privacy-versus-utility tradeoffs
2027 Decision: Hatch-only vs hybrid routing — informed by M2 learnings on messaging and privacy

M1/M2 Roadmap IG/WhatsApp Integration Privacy vs Utility 2027 Strategy

Key Themes & Decisions

The Major Threads

Beta Launch: The 72% vs 85% Debate

The single most discussed topic across all recordings. Recommendation relevance sat at 72% against an 85% target, creating a genuine strategic tension between quality and timing.

Saurabh's position: Launching below target risks early negative reviews and lost user trust, especially for niche destinations where the model struggles. Proposed a two-week delay to integrate augmented training data.

Pooja's position: The core itinerary experience shows 91% user satisfaction. A clear launch window exists. Launch-blocking reliability issues (budget overflow bug, stale hotel price caching) are higher priority than incremental relevance gains.

Resolution: Launch on schedule with a temporary geofence restricting the beta to major cities for the first two weeks. Saurabh's team packages improved model weights for silent deployment. Pooja specs out the geofencing logic and temporary UI flag.

Speaker Identification & Diarization

A technical thread that ran from April through June, evolving from basic measurement problems to a working prototype:

April: Poor speaker distribution metrics beyond 3 speakers; golden dataset commissioned
May: Diarization only working in factory settings; iOS E2E tests blocked
June 2–4: Live debugging sessions with 6+ test recordings; embedding system built with centroid + rolling samples; misidentification issues traced to noisy environments and incorrect tagging
Storage: Lightweight in-memory with Manifold persistence for ~50 tags per user, after Warm Storage stopped onboarding
Self/other fix: Training data merge caused self speech to flip in silence regions — retrained model + updated VAD resolved it

Live Notes Reliability

The team tracked reliability from 75% to 86%+ over the quarter:

Session success rate tracking gaps: logging only started after Nemo server connection, missing client-side failures
May 31 outage: routing change → secondary effects on SMC Service Discovery → final summaries failing ~50%
Build B276 fixes: RP connection drops, Wi-Fi/cellular switching, VIP fallback
TE reliability reached 86.11% by early June with two failure modes remaining
ASR quality: product decided to show live transcripts despite poor performance; summary quality prioritized over semantic endpointing

Dogfooding Milestones (DF6 → DF7)

The team managed aggressive dogfooding timelines while balancing quality expectations:

DF6: Established quality bar; poor logging skewed reliability numbers despite product working ("vibe check" exit criteria)
DF7: Code complete target 5/12, dogfood 5/19; acknowledged exit criteria won't be as clean as DF6
DF4 (Integrity): Filtering work planned for June 11–16; needed alignment from Alyssa's team
Ambient audio dogfooding set for June 17 with Nemo path on June 9
Key shift: "Stop debating, ship v0" — consensus to prioritize getting signal over polish

Architecture & Infrastructure

Several architectural decisions shaped the platform's direction:

Wiretapping mitigation: Manual upload flow as legal-safe path; on-device audio storage with phone-based retry mechanism
WebRTC study: Deep dive into OpenAI's ChatGPT Voice architecture — relay-plus-transceiver design, Go service handling signaling + media termination, one-port-per-session tradeoffs
M1/M2 roadmap: M1 mirrors 2.0 with on-device forms handling; M2 explores Instagram/WhatsApp integrations with privacy-utility tradeoffs informing 2027 Hatch-only vs hybrid routing decision
Open server adoption: Team discussed operating more as a backend team while supporting experience team on first-party apps

Outstanding Action Items

Key Deliverables

Consolidated and deduplicated from 198 raw action items across all sessions. Grouped by workstream.

Beta Launch Execution

Pooja — Implement geofence and temporary UI flag to restrict beta to major cities only for first 2 weeks

Assigned June 2–4 · Spec by EOD, implementation to follow

Saurabh — Package newly trained model weights for silent deployment within first week of beta launch

Assigned June 2–4 · Start immediately with team

Team — Fix budget overflow bug (model suggests options exceeding stated budget)

Assigned June 2 · Launch blocker

Team — Fix caching bug (stale hotel prices in ~10% of searches)

Assigned June 2 · Launch blocker

Raunaq — Augment training data with curated travel blogs and local tourism content; updated weights by end of next week

Assigned June 2

Speaker ID & Diarization

Colin — Assign embedding module update work to Jigar

Assigned June 3

Alex — Check if improvements can be made to auto speaker code; ping group when ready

Assigned June 3

Team — Get embeddings inside PE (legal won't allow putting it on open source)

Assigned June 4

Colin — Release new VAD with fixed hyperparameters (by EOD June 3)

Assigned June 3

Live Notes & Reliability

Pooja — Trace routing change that caused secondary effects on SMC tier and Service Discovery

Assigned June 2

Colin — Ask internal team to use staging gear with more logging for remaining error buckets

Assigned June 3

Raunaq — Follow up on contextual biasing work (blocked by design)

Assigned May 27

Raunaq — Speak with Honey about iOS client-side work and getting Limitless team to help with reliability

Assigned May 27

Documentation & Planning

Raunaq — Create half-pager: data model for pending/tagged audio snippets, insert/update flows

Assigned June 3 · Send to team immediately

Colin & Raunaq — Wrap up and share doc with team (Ahmed requesting it)

Assigned June 3 · By tomorrow

Raunaq — Re-engage legal team regarding data storage; consult with Nemo team re: Everstore

Assigned April 29

Key People

The Team

Most frequently appearing speakers across all recordings.

Raunaq

Voice AI Platform Lead · Senior Staff Engineer

Present in 80+ sessions

Pooja

Product / UX · Beta launch, geofencing, UI

36 sessions

Saurabh

ML / Model Quality · Recommendation relevance

28 sessions

Colin

Speaker ID / Diarization · VAD, integrity filtering

17 sessions

Alex

Platform & Integrations · Connect launch, Apple

12 sessions

Dan

Engineering · Audio pipeline, code changes

12 sessions

Jigar

Engineering · Embedding module updates

4 sessions

Anne

Live Notes · ASR quality, transcript visibility

Referenced in key reviews

Brad

ML · 9B model experiments, segmentation

Referenced in DF7 planning

Mathias & Dan

Stakeholders · DF7 dogfooding bar alignment

Referenced in stakeholder mgmt

Recording Activity

When You Recorded

A heatmap of recording intensity across the quarter. June 2–4 stands out as the most intense working period.

April

5

16

3

4

May

1

2

1

5

June

21

24

38

1

Each block represents one recording day. Color intensity reflects session volume.