Anthropic IC7 System Design

Deep Questions & Answers — May 22, 2026 Panel Prep

What IC7/Staff-level looks like in the room: You don't wait for questions. You drive requirements, state trade-offs before asked, make decisions and defend them with numbers, think about who operates this at 3 AM, propose SLOs unprompted, and show judgment about what NOT to build. Every answer below models this behavior.

Section 1: Chat 1:1 System (Q4)

Interviewers: Jake Luciani (Cassandra PMC, ex-DataStax Chief Architect) & Alexander Knott
Friday May 22, 12:00 – 12:55 PM PT

Q1JAKE WILL OPEN WITH THIS

"Design a 1:1 chat system for Claude — users send messages, Claude responds with streamed tokens."

IC7 opening (first 30 seconds): "Before I draw anything — let me scope this. We're building the conversational interface for Claude: a user sends a text message, the system orchestrates inference, and streams tokens back. I'll focus on 1:1 chat, not group. I'll design for 50M monthly active users, 10 messages/conversation average, with P99 time-to-first-token under 500ms. Safety filtering is a first-class architectural constraint, not a bolt-on. I'll explicitly call out where I'm making trade-offs."

Requirements I'm driving:

Functional: Create conversation, send message, receive streamed response, view history, search conversations, multi-device sync, conversation forking/branching
Non-functional: P99 TTFT < 500ms, message durability 99.999%, read-after-write consistency for same-user, eventual consistency cross-device (< 2s), 50M MAU / 500M messages/day
Safety: Every user message passes through input classifier before reaching inference. Every Claude response passes through output classifier before reaching user. Audit log for every message, immutable.

Back-of-envelope:

50M MAU × 3 conversations/day × 10 messages/conv = 1.5B messages/day
1.5B / 86,400 ≈ 17,400 messages/sec (write QPS)
Avg message: 200 tokens × 4 bytes = 800 bytes + metadata ≈ 1 KB
Daily storage: 1.5B × 1 KB = 1.5 TB/day, 547 TB/year
Active WebSocket connections (peak): ~5M concurrent
Token generation: ~60 tokens/sec per stream × 5M concurrent = budget for inference fleet

Fig 1: Full System Architecture

Client
(Web/Mobile)

→

API Gateway
Auth, Rate Limit

→

Chat Service
Orchestration

↓

Safety Pipeline
Input + Output

↓

Inference Router
Model Selection

↓

GPU Inference Fleet
Token Generation

Message Store
Cassandra (messages)

Conversation Meta
PostgreSQL (metadata)

Search Index
Elasticsearch

Connection Manager
SSE/WebSocket

Presence Service
Device tracking

Notification Svc
Push / Email

↓

Message Bus (Kafka)
Event sourcing · Async fanout · Audit trail

Fig 1: End-to-end architecture. Safety pipeline sits between user input and inference, and between inference output and user delivery — two distinct checkpoints.

Key architectural decisions:

SSE over WebSocket for token streaming: Claude responses are server-initiated, unidirectional streams. SSE is simpler, works through HTTP/2 proxies and CDN edges, auto-reconnects natively. WebSocket adds bidirectional complexity we don't need for token delivery. User messages go via standard POST.
Cassandra for messages, PostgreSQL for conversation metadata: Messages are append-heavy, partitioned by conversation_id, queried by recency. Cassandra's write-optimized wide-column model is ideal. Conversation metadata (title, user_id, model, created_at) is relational and low-volume — PostgreSQL is simpler and supports the join patterns we need for search/filtering.
Kafka as the event spine: Every message write publishes to Kafka. Consumers: search indexer, billing/metering, audit log writer, analytics, notification service. Decouples the hot path from downstream processing.

Trade-off: I chose SSE + POST over WebSocket. We lose bidirectional streaming, but gain HTTP/2 multiplexing, simpler CDN compatibility, and automatic reconnection. For a chat product where user messages are small and infrequent compared to token streams, this is the right call.

Safety: The safety pipeline is NOT a sidecar or async step. It's synchronous and in the critical path: user message → input classifier (Constitutional Classifier, ~20ms P99) → inference → output classifier (token-level + completion-level) → user. If the input classifier rejects, we return a canned safety response and log the attempt. The ~20ms latency cost is non-negotiable.

Q2JAKE'S DEEP DIVE

"How do you model conversations and messages in storage? Walk me through the schema."

IC7 signal: "I'll walk through both the Cassandra message table and the PostgreSQL metadata table, and explain why I split them — then I'll show you the partition key reasoning, because that's where this design lives or dies at scale."

Fig 2: Data Model — Cassandra Messages Table

Column	Type	Key	Notes
conversation_id	UUID	Partition Key	All messages for one conversation co-located
message_id	ULID	Clustering Key ↓ DESC	Time-ordered, globally unique, lexicographic sort
role	TEXT		'user' \| 'assistant' \| 'system'
content	TEXT		Message body (may be large for Claude responses)
token_count	INT		Metering: input or output tokens
model_version	TEXT		'claude-4-opus-20260501'
safety_result	TEXT		'pass' \| 'filtered' \| 'flagged'
created_at	TIMESTAMP		Wall clock (ULID encodes this too)
metadata	MAP<TEXT,TEXT>		Extensible: stop_reason, latency_ms, etc.

Partition key = conversation_id. Clustering key = message_id DESC (most recent first). A query for "last 20 messages" is a single partition scan with LIMIT 20 — one disk seek.

Fig 2b: Data Model — PostgreSQL Conversations Table

Column	Type	Key	Notes
conversation_id	UUID	PK	Matches Cassandra partition key
user_id	UUID	INDEX	Owner
title	TEXT		Auto-generated or user-set
model	TEXT		Primary model used
message_count	INT		Denormalized counter
last_message_at	TIMESTAMP	INDEX	For "recent conversations" query
created_at	TIMESTAMP
is_archived	BOOLEAN		Soft delete

PostgreSQL handles the conversation list, filtering, and join patterns. Query: "Show me my last 20 conversations" → SELECT * FROM conversations WHERE user_id = ? ORDER BY last_message_at DESC LIMIT 20

Why ULIDs, not UUIDs or timestamps:

ULIDs encode millisecond-precision time in the first 48 bits, followed by 80 bits of randomness. They sort lexicographically in time order — so Cassandra's clustering key ordering gives us chronological message order for free.
Unlike UUIDv1 (which also encodes time), ULIDs are monotonic within the same millisecond, preventing ordering ambiguity for rapid successive messages.
Unlike plain timestamps, they're globally unique without coordination — no sequence generator needed across distributed nodes.

Trade-off: Splitting messages (Cassandra) from metadata (PostgreSQL) adds operational complexity (two data stores to maintain). The alternative — all in PostgreSQL — works up to ~10M conversations, but the append-heavy write pattern of messages (1.5B/day) and the wide-row scan pattern ("last N messages in conversation") are exactly what Cassandra optimizes for. At our scale, the operational cost of two stores is justified by 10x better write throughput and predictable read latency.

Q3FOLLOW-UP PROBE

"What happens when a user reconnects mid-stream during token generation? They lost WiFi for 5 seconds."

IC7 signal: "This is one of the trickiest UX problems in streaming chat — and the answer is different depending on whether we're using SSE or WebSocket. I chose SSE specifically because it handles this more gracefully."

Fig 3: Reconnection Sequence Flow

Client

Server (Chat Service + Connection Manager)

SSE open

→ Stream begins. Each SSE event: id: {msg_id}:{token_offset}, data: {token}

Tokens 0-47 received

Server has sent tokens 0-47. Buffer continues to fill server-side.

WiFi drops

SSE connection broken. Server detects TCP RST within ~5s. Server keeps generating — tokens 48-85 buffered in Redis ring buffer (TTL: 5 min, keyed by stream:{msg_id}).

WiFi back

Client reconnects SSE with header: Last-Event-ID: {msg_id}:47

Resume

Server reads from Redis ring buffer, replays tokens 48-85, then resumes live stream from token 86+.

Complete

Stream finishes. Full message written to Cassandra. Redis buffer evicted.

Fig 3: SSE reconnection with Last-Event-ID. The server-side ring buffer bridges the gap. No tokens are lost, no duplicate generation needed.

Key design elements:

SSE's Last-Event-ID is the protocol-native reconnection mechanism. The browser automatically sends it on reconnect — zero client-side code needed.
Server-side ring buffer (Redis): Each active stream writes tokens to a Redis list keyed by stream:{message_id} with 5-minute TTL. This is cheap — a typical response is ~500 tokens × 5 bytes = 2.5 KB per stream.
Inference continues during disconnect: We do NOT cancel GPU inference when the client disconnects for < 30 seconds. GPU time is expensive; re-running inference is worse than buffering 2.5 KB. For disconnects > 30s, we cancel inference and write the partial response to Cassandra.

Trade-off: We pay ~2.5 KB Redis memory per active stream to avoid re-running $0.02-0.10 of GPU inference. At 5M concurrent streams, that's 12.5 GB of Redis — trivially small. The alternative (cancel and re-run on reconnect) wastes GPU capacity and adds 2-10s of latency to the user experience.

"How do you handle message ordering across distributed nodes? What if two messages arrive out of order?"

IC7 signal: "In a 1:1 chat with an LLM, ordering is simpler than in a multi-party system — but there are still real edge cases I want to address."

Why ordering is (mostly) simpler here: In a Claude 1:1 chat, the conversation is strictly turn-based: user sends → Claude responds → user sends. There's natural serialization because Claude can't respond until the user message arrives, and the user typically waits for Claude's response. This eliminates 90% of multi-party ordering problems.

Where ordering still matters:

Rapid successive messages: User sends "What is X?" then immediately sends "Actually, what about Y?" before Claude starts responding. Both hit the Chat Service within milliseconds.
Multi-device: User sends from phone, then immediately sends from laptop — different TCP connections, different server instances.
Conversation branching: User edits a previous message, creating a fork — which message is the "real" next one?

Solution: ULIDs + optimistic concurrency on conversation version:

Each message gets a ULID (time-ordered, monotonic within a node). Cassandra's clustering key sorts these correctly.
Each conversation has a version counter in PostgreSQL. When the Chat Service processes a user message, it does: UPDATE conversations SET version = version + 1 WHERE conversation_id = ? AND version = ?. If the CAS fails (concurrent write from another device), retry with conflict resolution.
For the "rapid messages" case: queue them in the Chat Service and process sequentially per conversation. A per-conversation in-memory lock (or Redis distributed lock with ~100ms TTL) ensures only one message is being processed at a time for a given conversation.

Trade-off: The per-conversation lock serializes concurrent messages, adding ~5ms of latency in the rare case of true simultaneous sends. This is acceptable because (a) it's rare in 1:1 chat, (b) it guarantees the LLM sees messages in the order the user intended, and (c) the alternative — accepting out-of-order messages and trying to reorder — is far more complex and error-prone for the inference context window.

Q5JAKE'S CASSANDRA PROBE

"A single conversation goes viral with 100K messages. How does your storage handle it?"

IC7 signal: "This is the wide partition problem, and it's one of the first things I think about when I choose conversation_id as my partition key. Let me walk through why I'm not worried for our use case, but also what I'd do if I were."

The math:

100K messages × 1 KB avg = 100 MB per partition
Cassandra recommendation: partitions under 100 MB, ideally under 10 MB
100K messages in a Claude conversation = borderline

Why this is unlikely but addressable:

Real-world distribution: 99.9% of Claude conversations have < 200 messages (Claude's context window is ~200K tokens; at ~50 tokens/message, that's ~4000 messages max before the context is full). A 100K-message conversation would require the user to explicitly clear context and continue hundreds of times.
But if it happens (defensive design):

Bucketed partitioning strategy:

Fig 4: Partition Bucketing for Wide Conversations

Column	Type	Key	Notes
conversation_id	UUID	Partition Key (1/2)	Conversation identifier
bucket	INT	Partition Key (2/2)	`bucket = message_sequence / 1000`
message_id	ULID	Clustering Key ↓	Time-ordered within bucket
...same columns as before...

Compound partition key: (conversation_id, bucket). Each bucket holds at most 1,000 messages (~1 MB). A 100K-message conversation spans 100 partitions, each well under Cassandra's comfort zone.

Query pattern changes:

"Last 20 messages" → query the latest bucket first. If it has < 20, query the previous bucket. At most 2 partition reads for the common case.
"All messages" → parallel query across all buckets for the conversation, merge by ULID order client-side.
A conversation_buckets counter in PostgreSQL tracks the current bucket number, so we always know the latest bucket without scanning.

But here's my IC7 judgment call: I would NOT implement bucketing on day one. I'd ship with simple conversation_id partitioning, set a monitoring alert at partition size > 50 MB, and add bucketing when we see real data hit that threshold. Premature optimization of a partition scheme that affects every query is worse than a migration later. The 99.9th percentile conversation will never hit this limit.

Trade-off: Simple partition key now (one column) vs. compound key (two columns) later. Simple key means every query is a single partition read — maximally efficient. Compound key adds a bucket lookup step but handles unbounded conversations. I optimize for the common case and have a clear migration path for the extreme case.

"How do you implement search across all conversations for a user?"

IC7 signal: "Search is fundamentally a different access pattern from chat retrieval, and I would never bolt it onto Cassandra. This is a separate indexing pipeline."

Fig 5: Search Indexing Pipeline

Message Write
Cassandra

→

Kafka
message.created topic

→

Search Indexer
Consumer group

→

Elasticsearch
Full-text index

User Query
"find my recipe chat"

→

Search Service
Query + filter

→

Elasticsearch
user_id filter + text match

→

Results
conv_id + snippet

Fig 5: Async indexing via Kafka. Search never touches Cassandra directly. Elasticsearch index is partitioned by user_id for tenant isolation.

Elasticsearch index schema:

Index per month (time-based rollover for retention and performance): messages-2026-05
Routing key: user_id — all of one user's messages land on the same shard, so search is a single-shard query
Fields: user_id (keyword), conversation_id (keyword), role (keyword), content (text, analyzed), created_at (date), model (keyword)
Query: {"bool": {"filter": [{"term": {"user_id": "..."}}], "must": [{"match": {"content": "recipe"}}]}}

Search latency target: P99 < 200ms. This is achievable with user_id routing (single-shard) and warm caches for active users.

Why not Cassandra SASI or SAI? Jake may probe this directly given his DataStax background. SAI (Storage-Attached Indexes) in Cassandra 5.0+ support basic text search, but they're partition-local — you can't search across all conversations for a user without a full cluster scan. Elasticsearch gives us cross-partition full-text search with relevance scoring, which is what users actually need.

Safety: The search index inherits the safety classification from the message. Filtered messages (where the safety pipeline blocked content) are indexed with safety_result: "filtered" and excluded from search results by default. This prevents users from searching for and retrieving content that was blocked by the safety pipeline.

Trade-off: Adding Elasticsearch is a third data store (Cassandra + PostgreSQL + ES). The operational cost is real. But search is a first-class feature expectation, and the alternative (scanning Cassandra partitions or relying on SAI) doesn't meet the latency or relevance requirements. I'd staff a dedicated search-infra on-call rotation for this component.

Q7SAFETY DEEP DIVE

"Walk me through the safety pipeline for each message — both user input and Claude output."

IC7 signal: "Safety is the reason Anthropic exists. I'm going to design this as a synchronous, mandatory pipeline — not an optional middleware. Every message, every token, every time."

Fig 6: Safety Pipeline — Dual-Gate Architecture

User Message

↓

GATE 1: Input Classifier
Constitutional Classifier · Prompt injection detector · PII scanner
P99 latency: ~20ms

PASS

↓

Inference
GPU Fleet

BLOCK

↓

Safety Response
"I can't help with..."

↓ (streaming tokens)

GATE 2: Output Classifier
Token-level monitor (sliding window) · Completion-level classifier · Hallucination flag
Inline: ~2ms/token | End-of-stream: ~30ms

PASS

↓

Deliver to User
SSE stream

INTERCEPT

↓

Truncate + Warn
Mid-stream safety stop

↓

Audit Log (Immutable)
Every message + classifier result + latency + model version · Kafka → S3 cold storage

Fig 6: Dual-gate safety architecture. Input gate blocks harmful prompts before they reach inference. Output gate monitors tokens as they stream and can truncate mid-response.

Gate 1 — Input Classification (pre-inference):

Constitutional Classifier: Lightweight model trained on Anthropic's constitution. Classifies input across harm categories (violence, CSAM, weapons, PII extraction, jailbreaks). Binary pass/block with confidence score.
Prompt injection detector: Pattern matching + learned classifier for attempts to override system prompts. Particularly important for API users who set custom system prompts.
PII scanner: Regex + NER for credit cards, SSNs, API keys. Can redact before passing to inference rather than blocking entirely.
Latency budget: All three run in parallel. Combined P99: ~20ms. This is pre-inference, so it's additive to TTFT.

Gate 2 — Output Classification (during + post-inference):

Token-level sliding window: Every N tokens (e.g., 10), run a lightweight classifier on the last 50 tokens. If harmful content is emerging, truncate the stream immediately and send a safety stop event. ~2ms per check.
Completion-level classifier: After the full response is generated, run a comprehensive classifier. If it fails, replace the response with a safety message (even if tokens were already streamed — send a correction event).
This two-layer approach catches both gradual drift and full-response failures.

Safety: The audit log is append-only and immutable. Every message (user and assistant), every classifier result, every latency measurement, every model version is logged. This feeds the RSP (Responsible Scaling Policy) compliance pipeline. Retention: 7 years for safety-flagged content, 1 year for normal content.

"How do you handle billing and token counting with streaming responses?"

IC7 signal: "Billing accuracy is a trust issue, not just an accounting issue. If we overcharge, we lose customers. If we undercharge, we lose money on GPU inference — the most expensive line item in the company. I'll design for exact counting, not estimation."

Token counting architecture:

Input tokens: Counted by the Chat Service before sending to inference. The tokenizer runs on CPU — it's deterministic and fast (~1ms for a typical message). The count is written to the message record and published to Kafka.
Output tokens: Counted by the inference worker as it generates. Each token emission increments a counter. The final count is written to the message record when the stream completes. For interrupted streams (client disconnect, safety truncation), we count tokens actually generated (which cost GPU time), not tokens delivered.
Metering pipeline: Kafka consumer aggregates token counts per user/org per hour into a time-series store (ClickHouse or TimescaleDB). This feeds the billing API and usage dashboards.

Edge cases that matter for billing:

Safety-filtered responses: If the input classifier blocks a message, the user is NOT charged (no inference ran). If the output classifier truncates mid-stream, the user IS charged for tokens generated (GPU time was consumed).
Reconnection: Tokens replayed from the Redis buffer are not double-counted — the count was set when they were first generated.
Model version pricing: Different models have different per-token prices. The model_version on each message record ties to a pricing table. This is immutable — price changes only apply to new messages.

Metering pipeline volume:
1.5B messages/day × 2 events (input + output count) = 3B metering events/day
Aggregated hourly: 50M users × 24 hours = 1.2B rows/day in time-series store
ClickHouse handles this trivially with columnar compression.

Trade-off: Exact token counting (tokenizer on every message) vs. estimation (approximate by character count / 4). Exact counting adds ~1ms latency and requires the tokenizer binary to be available on Chat Service nodes. I choose exact because billing disputes are expensive in support time and customer trust, and 1ms is invisible in the context of inference latency.

Q9JAKE'S CONSISTENCY PROBE

"What's your consistency model? What guarantees do you actually provide to users?"

IC7 signal: "I'll be precise about what consistency level I'm choosing at each layer and why — because 'strong consistency' is not a useful answer without specifying the scope."

Consistency guarantees by layer:

Operation	Consistency	Implementation	Why
Write message	LOCAL_QUORUM	Cassandra: RF=3, write to 2/3 replicas	Durability without cross-DC latency
Read own messages (same user)	Read-after-write	LOCAL_QUORUM reads; or route to same node (sticky sessions)	User must see their own message immediately
Read on other device	Eventual (~2s)	LOCAL_ONE read + device sync via Kafka consumer	Acceptable lag for multi-device; consistent within seconds
Conversation list	Read-after-write	PostgreSQL with synchronous_commit	User creates conversation, must see it in list immediately
Search index	Eventual (~5s)	Kafka → Elasticsearch indexer consumer lag	Users accept search being slightly behind. No business-critical real-time search.
Token count / billing	Exact, eventually consistent	Kafka → ClickHouse aggregation, hourly flush	Must be accurate but not real-time. Billing is hourly.

The critical insight: In a 1:1 chat with an LLM, the strongest consistency requirement is read-your-own-writes for the active user on the active device. Everything else can be eventual. This is because the user is the only writer to their conversation (Claude's responses are system-generated through the same pipeline), and the most important UX guarantee is "I see what I just typed."

Cassandra consistency levels:

Writes: LOCAL_QUORUM (2 of 3 replicas in the local DC). This ensures durability without paying cross-DC latency. If we have a US-East and EU-West DC, a US user's write hits 2/3 US replicas synchronously and replicates to EU asynchronously.
Reads: LOCAL_QUORUM for the active session (guarantees read-after-write). LOCAL_ONE for history loads and non-active sessions (lower latency, acceptable staleness).

Trade-off: LOCAL_QUORUM writes cost ~2x the latency of LOCAL_ONE (must wait for 2 replicas instead of 1). At P99, this is ~5ms vs ~2ms. I accept this because losing a user's message after confirming delivery is unrecoverable from a trust perspective. 3ms is invisible against inference latency.

Q10

"How do you handle multi-device sync? User has the chat open on their phone and laptop simultaneously."

IC7 signal: "Multi-device sync in a chat system is often over-engineered. For a 1:1 LLM chat, the sync model is simpler than WhatsApp or iMessage because there's only one conversation partner (Claude), and responses are generated server-side."

Fig 7: Multi-Device Sync Flow

Phone

↓ SSE

Laptop

↓ SSE

Presence Service
Tracks: user_id → [device_1 (phone, SSE conn A), device_2 (laptop, SSE conn B)]

↓

Connection Manager
Maintains SSE connections per device. Fans out events to all active devices for a user.

↕ Kafka

Message Store

Inference Fleet

Fig 7: Both devices maintain independent SSE connections. The Connection Manager fans out message events and token streams to all active connections for the same user.

How sync works:

User sends from phone: POST goes to Chat Service → message written to Cassandra → event published to Kafka. The Connection Manager's Kafka consumer sees the event, looks up all active SSE connections for this user, and pushes the message to the laptop's SSE stream.
Claude responds: Token stream is fanned out to ALL active device connections simultaneously. Both phone and laptop see tokens appear at the same time.
Device comes online after being offline: On SSE reconnect, the client sends its Last-Event-ID. The Connection Manager replays missed events from Kafka (or a Redis catchup buffer if within 5 min, or queries Cassandra directly for older messages).

Conflict resolution (user types on both devices):

Last-writer-wins with the per-conversation version counter from Q4. If both devices send a message simultaneously, one gets the version bump, the other gets a conflict error and retries.
The UX: show "sending..." on both, but only one actually gets queued for inference. The other device's message is requeued as the next message in the conversation.

Trade-off: Fan-out via Connection Manager + Kafka vs. polling. Polling at 1-second intervals for 5M concurrent users = 5M QPS of unnecessary reads. Push via SSE fan-out costs one Kafka consumer group and Connection Manager memory (~1 KB per SSE connection × 2 devices × 5M users = 10 GB). Push wins by 3 orders of magnitude in efficiency.

Section 2: Inference API (Q1)

Interviewers: Anton Paquin & Bharat Mediratta (Google Distinguished Engineer / GWS, ex-CTO Dropbox, Coinbase Fellow)
Friday May 22, 2:00 – 2:55 PM PT

Q1BHARAT WILL OPEN WITH THIS

"Design an inference API for Claude — external developers send requests, get model responses back."

IC7 opening (first 30 seconds): "I'll design the public-facing API that serves Claude to external developers. This is Anthropic's revenue engine — every dollar flows through this API. I'll scope to: multi-model serving (Haiku, Sonnet, Opus), synchronous + streaming responses, multi-tenant isolation with per-org rate limiting, and a target of P99 time-to-first-token under 800ms for Sonnet. I'll explicitly call out where safety constraints shape the architecture differently from a generic serving system."

Requirements I'm driving:

Functional: /v1/messages endpoint (sync + streaming), /v1/batch for async bulk jobs, model selection by name+version, system prompts, multi-turn conversations, tool use / function calling, token counting in response
Non-functional: P99 TTFT < 800ms (Sonnet), P99 per-token latency < 30ms, availability 99.95% (26 min downtime/month budget), throughput 500K requests/min peak, multi-region (US, EU, APAC)
Safety: Constitutional Classifier on every request. Content filtering before AND after inference. Per-org audit logs. Compliance with ASL-3 for frontier models.
Business: Per-org API key auth, usage-based billing (input/output tokens), rate limiting at org and model tier, cost transparency

Back-of-envelope:

500K requests/min = 8,333 req/sec
Avg request: 2,000 input tokens + 500 output tokens
Sonnet throughput: ~150 tokens/sec/GPU (decode phase)
Time per request (decode): 500 tokens / 150 tokens/sec = 3.3 sec
GPU-seconds per request: 3.3 sec
Concurrent requests: 8,333 × 3.3 = ~27,500 concurrent GPU-bound requests
With continuous batching (batch size ~32): 27,500 / 32 = ~860 GPUs for Sonnet alone
At $2/hr/GPU (H100): $1,720/hr = $15M/year GPU cost for Sonnet at peak
Revenue at $3/$15 per M input/output tokens: ~$1.5B/year at this scale
Margin: ~99%. GPU cost is manageable. The bottleneck is GPU availability, not cost.

Fig 8: Full Inference API Architecture

Developer Client
SDK / cURL

→

Edge / CDN
Cloudflare: DDoS, TLS

→

API Gateway
Auth · Rate Limit · Route

↓

Safety Pipeline
Constitutional Classifier

↓

Request Router
Model → Cluster mapping

↓

Tenant Scheduler
Weighted fair queue

↓

Inference Workers
GPU fleet (H100s)

↓

Output Classifier
Token + completion check

↓

Streaming Proxy
SSE to client

Model Registry
Versions, weights URIs

KV Cache Store
Prompt prefix cache

Rate Limit Store
Redis: token buckets

Metering Store
Kafka → ClickHouse

Audit Log
Kafka → S3 (immutable)

Fig 8: Multi-layer architecture. Safety pipeline is inline (not sidecar). Tenant scheduler enforces fairness before GPU allocation. All state stores are purpose-built.

Key architectural decisions:

Separate tenant scheduler before GPU allocation: This is not just rate limiting at the gateway — it's a weighted fair queue that ensures org A's 10K req/min burst doesn't starve org B's 100 req/min. This is the most important component Bharat will probe.
Model Registry as source of truth: Immutable model versions. Each version has a unique ID (e.g., claude-4-sonnet-20260501), a weight file URI, a configuration hash, and a safety classifier version. Rollbacks are instant: just update the routing table to point to the previous version.
Streaming Proxy as a separate component: Decouples long-lived SSE connections from inference workers. Workers generate tokens and push to an internal queue; the Streaming Proxy holds the client connection and drains the queue. This means we can scale workers and connection-holders independently.

Trade-off: Separate Streaming Proxy adds a hop (~1ms) but lets us scale inference workers independently from connection management. At 27K concurrent streams, holding connections on inference workers would waste GPU-attached memory on TCP buffers. The proxy runs on cheap CPU nodes.

Q2BHARAT'S API DESIGN PROBE

"Design the API contract. Show me the endpoints, request/response schemas."

IC7 signal: "I'll design the external API with the same rigor as a public specification — versioned, idempotent, with explicit error contracts. Bharat built GWS; he knows what a production API contract looks like."

// Primary endpoint — synchronous + streaming
POST /v1/messages

// Request
{
  "model": "claude-4-sonnet-20260501",
  "max_tokens": 1024,
  "messages": [
    {"role": "user", "content": "Explain quantum computing"}
  ],
  "system": "You are a physics tutor.",
  "stream": true,
  "temperature": 0.7,
  "metadata": {"user_id": "usr_abc"} // opaque, for billing attribution
}

// Headers
Authorization: Bearer sk-ant-...
X-Request-ID: req_idempotency_key_123 // idempotency
Anthropic-Version: 2026-05-01 // API version pinning

// Streaming response (SSE)
event: message_start
data: {"id": "msg_01X...", "model": "claude-4-sonnet-20260501", "usage": {"input_tokens": 42}}

event: content_block_delta
data: {"type": "text_delta", "text": "Quantum computing uses"}

event: content_block_delta
data: {"type": "text_delta", "text": " qubits instead of"}

event: message_stop
data: {"usage": {"output_tokens": 347}, "stop_reason": "end_turn"}

// Batch endpoint — async
POST /v1/messages/batches

// Request: array of message requests with custom_ids
{"requests": [{"custom_id": "req-1", "params": {...same as /v1/messages...}}]}

// Response: job handle
{"id": "batch_01X...", "status": "in_progress", "request_counts": {"total": 100}}

// Poll
GET /v1/messages/batches/batch_01X...

// Error schema (ALL endpoints)
{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Rate limit exceeded for org_abc"
  },
  "retry_after": 2.5 // seconds — clients should backoff
}

API versioning strategy:

Header-based versioning (Anthropic-Version: 2026-05-01), not URL-based. This allows us to make backward-compatible changes without forcing URL migrations. Breaking changes require a new version date.
Model version is explicit: Clients pin to claude-4-sonnet-20260501. The alias claude-4-sonnet (no date) resolves to the latest — but we log which exact version was used in the response, so clients can reproduce results.

Idempotency:

X-Request-ID is the idempotency key. If we receive a duplicate within 24 hours, we return the cached response without re-running inference. The dedup store is Redis with 24h TTL, keyed by {org_id}:{request_id}.
This is critical for retries after network timeouts — the client might not know if the request was processed. Without idempotency, a retry would generate a different response (temperature > 0) and double-charge the org.

Trade-off: Header-based versioning vs. URL-based (/v1/ vs /v2/). Header-based is cleaner for evolution (new fields, new optional params) but harder for developers to discover (they need to read docs, not just change the URL). I choose header-based because our API surface changes frequently (new model params, new tool types) and URL-based would create a combinatorial explosion of versions.

Q3BHARAT'S FAIRNESS PROBE

"How do you handle multi-tenant isolation so one customer can't starve others?"

IC7 signal: "This is the hardest problem in a multi-tenant serving system, and it's one Bharat lived at Google. The naive approach — per-org rate limits at the gateway — doesn't work because the bottleneck isn't HTTP requests, it's GPU-seconds. I need to solve fairness at the GPU scheduling layer."

Fig 9: Multi-Tenant Queuing Architecture

Org A
Enterprise
50K RPM limit

→

API Gateway
Token bucket per org

Layer 1: Admission Control

Reject if over RPM limit

→

Tenant Scheduler
Weighted Fair Queue (WFQ)

Layer 2: GPU Fairness

Weight = spend tier

→

GPU Pool
Workers

Org B
Startup
1K RPM limit

→

Same gateway
Lower token bucket

→

Same scheduler
Lower weight, still fair

→

Same GPUs
Shared pool

Fig 9: Three-layer isolation. Layer 1 (gateway) rejects over-limit requests. Layer 2 (scheduler) ensures GPU time is allocated fairly. Layer 3 (worker) enforces memory/compute isolation per request.

Three layers of isolation:

Layer 1 — Admission Control (API Gateway):

Token bucket rate limiter per org, per model, stored in Redis. Three dimensions: requests/min, input tokens/min, output tokens/min.
Returns 429 Too Many Requests with retry_after header. This is the coarse filter — keeps volume within contracted limits.
But this alone doesn't prevent starvation. An enterprise org with 50K RPM limit can still monopolize GPUs if their requests are all Opus (10x more GPU time than Haiku).

Layer 2 — Weighted Fair Queuing (Tenant Scheduler):

This is the critical layer. Each org gets a virtual queue. The scheduler dequeues using Weighted Fair Queuing (WFQ): weight = org's contracted spend tier.
When GPUs become available, the scheduler picks the next request from the org with the highest deficit (most underserved relative to its weight). This ensures even during overload, every org gets GPU time proportional to their tier.
Concurrency cap per org: In addition to WFQ, each org has a max concurrent inference requests (e.g., Enterprise = 200, Startup = 10). This prevents one org from consuming all GPU memory with long-running Opus requests.
Queue depth limit: If an org's queue exceeds 1000 pending requests, new requests are rejected with 529 (overloaded). This prevents unbounded memory growth.

Layer 3 — Worker-Level Isolation:

Each inference request runs in an isolated memory space on the GPU. KV cache allocation is bounded per request (max_tokens controls this). No request can OOM the worker by consuming another request's KV cache.
Request-level timeout: if inference exceeds 120s, the request is killed and the GPU memory is reclaimed.

Trade-off: Weighted Fair Queuing adds ~5ms of scheduling overhead per request (queue lookup + weight calculation + dequeue). For a 3-second inference request, this is 0.17% overhead. The alternative — FIFO queue with no fairness — means a burst from one enterprise customer can add 10+ seconds of queue time for every other customer. The 5ms is non-negotiable for multi-tenant trust.

Q4BHARAT'S CAPACITY PROBE

"Walk me through capacity planning. How many GPUs do you need, and how do you decide?"

IC7 signal: "I'll use Little's Law to derive GPU count from SLO requirements, not just throughput. The constraint isn't 'can we serve all requests' — it's 'can we serve them within our latency SLO while maintaining headroom for bursts.'"

Little's Law: L = λ × W

Given:
λ (arrival rate) = 8,333 req/sec (500K RPM)
W (avg service time) = 3.3 sec (500 output tokens / 150 tokens/sec for Sonnet)
L (concurrent requests) = 8,333 × 3.3 = 27,500 concurrent requests

With continuous batching:
Batch size per GPU: ~32 requests (memory-limited by KV cache, 80GB H100)
GPUs needed: 27,500 / 32 = 860 GPUs for steady-state Sonnet

But we need headroom:
Target utilization: 70% (leave 30% for bursts and degradation)
GPUs provisioned: 860 / 0.7 = 1,229 GPUs

Per model tier:
Opus: ~40 tokens/sec/GPU, batch 8 → more GPUs/req → 3x cost
Haiku: ~500 tokens/sec/GPU, batch 64 → very efficient → 0.3x cost

Mixed fleet estimate:
Sonnet (60% traffic): 1,229 GPUs
Opus (10% traffic): 600 GPUs (lower throughput per GPU)
Haiku (30% traffic): 200 GPUs (higher throughput per GPU)
Total: ~2,030 H100 GPUs
At $2/hr/GPU: $4,060/hr = $35.5M/year

Capacity planning methodology (what Bharat will want to hear):

SLO-driven, not throughput-driven: I don't size for "can we serve 500K RPM." I size for "can we serve 500K RPM with P99 TTFT < 800ms." These are different numbers because P99 requires headroom for queuing delay.
Queuing theory: At 70% utilization, the M/G/1 queue model gives P99 queue time of ~10× average service time × (1/(1-ρ)). At 90% utilization, queue times explode. This is why 70% target, not 90%.
Regional distribution: US-East (50%), EU-West (30%), APAC (20%). Each region is self-sufficient — no cross-region inference. Regional isolation means a US outage doesn't cascade to EU.
Autoscaling signal: Queue depth × avg service time (estimated queue wait). Scale up at > 2s estimated wait, scale down at < 0.5s. Use predictive scaling for known patterns (weekday 9am traffic ramp).

Trade-off: 70% target utilization means 30% of GPU spend (~$10.7M/year) is "wasted" as headroom. But at 90% utilization, P99 latency doubles due to queuing. Given that every ms of TTFT directly impacts developer satisfaction and API adoption, I'd rather over-provision GPUs than lose customers to latency. GPU cost is predictable; customer churn is not.

"What happens during a GPU OOM during inference? Walk me through the failure and recovery."

IC7 signal: "GPU OOM is the most common inference failure mode, and it's fundamentally different from CPU OOM because GPU memory is shared across batched requests. A single OOM can kill an entire batch."

Fig 10: GPU OOM Recovery Flow

GPU Worker
Running batch of 32 requests

↓ OOM during KV cache allocation for request #33

Step 1: Detect

CUDA OOM exception caught
by inference engine.
Worker marks itself DEGRADED.

Step 2: Triage

Reject the new request (#33)
with 529 Overloaded.
Existing batch continues.

Step 3: Evict

Preempt lowest-priority
request in batch. Free
its KV cache. Requeue it.

↓

Step 4: Recover

Worker processes remaining
batch. Marks itself HEALTHY
when memory drops below 85%.

Step 5: Requeue

Evicted + rejected requests
return to Tenant Scheduler.
Routed to different worker.

Step 6: Alert

If OOM rate > 1%/min,
fire PagerDuty alert.
Likely need more GPUs.

Fig 10: OOM is contained to the worker. No cascading failure. Rejected requests are requeued, not dropped. The preemption policy is priority-based (free-tier users preempted before enterprise).

Prevention mechanisms (most OOMs should never happen):

Admission control on KV cache: Before adding a request to a batch, the worker estimates KV cache memory: KV_size = 2 × num_layers × hidden_dim × (input_tokens + max_tokens) × sizeof(float16). If estimated total exceeds 85% of GPU memory, reject admission to this batch.
Max concurrent batch size: Hard cap based on model size. For Sonnet on H100 (80GB): model weights ~30GB, leaving ~50GB for KV cache. At ~1MB per 1K tokens, that's ~50 concurrent requests with 1K context each. Dynamic cap adjusts based on actual context lengths.
Memory monitoring: Worker reports GPU memory utilization every 100ms to the scheduler. Scheduler reduces routing weight to workers above 80% utilization.

Trade-off: Preemption (evicting a running request) vs. rejection-only (just reject the new request). Preemption is more complex (need to save partial state, requeue cleanly) but prevents head-of-line blocking where long-running Opus requests monopolize GPU memory and prevent shorter Haiku requests from starting. I implement preemption with a priority order: free-tier → lowest-tier paid → highest-tier paid → internal.

"How do you do safe model rollouts? A new model version has a subtle quality regression."

IC7 signal: "Model rollouts are fundamentally harder than code rollouts because regressions are subtle — a model might generate slightly worse responses without any errors or latency changes. I need both automated and human-in-the-loop validation."

Fig 11: Model Rollout — Canary Strategy

Stage

What Happens

1. Shadow
(Day 1-2)

New model receives 0% live traffic. 100% of requests are duplicated to new model in shadow mode. Responses are logged but NOT served to users. Compare: latency distribution, token count distribution, safety classifier pass rate, eval harness scores. Automated gate: must match or exceed baseline on all metrics.

2. Canary
(Day 3-4)

Route 1% of live traffic to new model. Real users, real responses. Monitor: error rate delta, TTFT delta, user-reported thumbs-down rate, safety filter trigger rate. Automated gate: no metric regression > 2% relative. Human review of 100 random responses.

3. Ramp
(Day 5-7)

Ramp: 1% → 5% → 25% → 50% → 100%. Each step holds for 12-24 hours. Automated rollback trigger: if any step shows error rate > baseline + 0.1%, latency P99 > baseline + 10%, or safety trigger rate > baseline + 0.5%, automatically roll back to previous version.

Rollback
(instant)

Update routing table: model_alias → previous_version_id. Takes effect in < 10 seconds (next routing table refresh). No GPU restart needed — both model versions are loaded in parallel during rollout. Old version workers are kept warm for 72 hours post-completion.

Fig 11: Four-stage rollout with automated gates at each stage. Shadow testing catches obvious regressions. Canary catches subtle quality issues. Rollback is instant because both versions remain loaded.

The hard part — detecting quality regression:

Latency and error rate are easy. Quality regression (model gives slightly worse answers) is hard because there's no objective metric.
Eval harness: Run a fixed eval set (1000 prompts with known-good responses) against the new model. Score using automated judges (another Claude instance scores quality 1-5). Flag if mean score drops > 0.2.
Thumbs-down rate: During canary, compare user thumbs-down rate between old and new. Statistically significant increase (p < 0.05) triggers rollback.
Safety regression: If the safety classifier fires more often on the new model's output, that's a signal the model is generating more borderline content — even if none is actually served to users.

Safety: Model rollouts for frontier models (ASL-3+) require a mandatory safety evaluation before any live traffic. This includes red-team testing for novel attack vectors, eval on the safety benchmark suite, and sign-off from the safety team. No automated override — a human must approve progression from shadow to canary for ASL-3 models.

Trade-off: Parallel loading (both versions on GPUs simultaneously) wastes GPU memory during rollout — potentially 2x model weight memory for 7 days. For a 30GB Sonnet model across 1,200 GPUs, that's 36TB of extra GPU memory. I accept this because instant rollback is worth more than the GPU cost. The alternative (unload old model, load new model) takes ~60 seconds per worker and means rollback is also 60 seconds — during which users see degraded quality.

"Design the rate limiting system — per-org, per-model, global."

IC7 signal: "Rate limiting at an inference API is multi-dimensional. I can't just count requests — I need to count tokens, because a 100K-token Opus request consumes 1000x more resources than a 100-token Haiku request."

Fig 12: Three-Tier Rate Limiting Architecture

Tier 1: Global

Global Rate Limiter
Protect system capacity
500K RPM total across all orgs
Token bucket in Redis Cluster

Tier 2: Per-Org

Org Rate Limiter
Enforce contracted limits
3 dimensions: RPM, input TPM, output TPM
Sliding window in Redis

Tier 3: Per-Model

Model Rate Limiter
Protect scarce models
Opus: lower limits (expensive GPUs)
Haiku: higher limits (cheap GPUs)

Redis Cluster
Shared state. Each tier is a separate key namespace. Lua script for atomic check-and-decrement.
ratelimit:{org_id}:{model}:{dimension} → {tokens_remaining, window_start}

Fig 12: Request must pass all three tiers. Global protects system capacity, per-org enforces contracts, per-model protects scarce GPU resources.

Implementation — Sliding Window + Token Bucket hybrid:

Request rate: Sliding window counter (Redis ZSET with timestamps). Counts requests in the last 60 seconds. Simple, accurate, no burst problem.
Token rate: Token bucket (Redis key with counter + refill timestamp). Refills at contracted TPM / 60 per second. This allows short bursts (up to bucket capacity) while enforcing sustained rate.
Pre-check vs. post-check: Input tokens are known before inference (we run the tokenizer). Output tokens are estimated (max_tokens as upper bound). We deduct the estimate from the token bucket, then credit back the difference after inference completes. This prevents over-admission while avoiding over-charging.

Rate limit headers in every response:

x-ratelimit-limit-requests: 10000
x-ratelimit-remaining-requests: 9547
x-ratelimit-limit-tokens: 1000000
x-ratelimit-remaining-tokens: 847231
x-ratelimit-reset-requests: 2026-05-22T12:01:00Z
x-ratelimit-reset-tokens: 2026-05-22T12:00:32Z

Trade-off: Redis for rate limit state means a Redis failure could either (a) block all requests (fail-closed) or (b) allow unlimited requests (fail-open). I choose fail-open with local fallback: if Redis is unreachable, each API gateway node enforces a conservative local rate limit (1/N of the global limit, where N = number of gateway nodes). This prevents total outage while still providing some protection. Redis is deployed as a 6-node cluster with automatic failover — Redis downtime should be < 10 seconds.

Q8DEEP TECHNICAL PROBE

"How does continuous batching work? Walk me through prefill vs. decode phases."

IC7 signal: "Continuous batching is the single biggest throughput optimization in LLM serving, and understanding the prefill/decode asymmetry is essential to capacity planning. Let me walk through the GPU timeline."

Fig 13: Continuous Batching — GPU Timeline

Time →

t10

t11

Req A

D✓

Req B

D✓

Req C

D✓

Req D

D✓

P = Prefill (compute-bound) D = Decode (memory-bound) ✓ = Complete

Fig 13: Continuous batching. New requests (C, D) join the batch as old requests (A) complete. The GPU is never idle. Prefill steps (P) are interleaved with decode steps (D).

Prefill vs. Decode — fundamentally different GPU workloads:

Property	Prefill	Decode
What it does	Process all input tokens in parallel. Build KV cache.	Generate one output token per step. Read KV cache.
GPU bottleneck	Compute-bound (matrix multiplies on all input tokens)	Memory-bandwidth-bound (read model weights + KV cache per token)
Latency	~100-500ms for 2K tokens (depends on model)	~6-20ms per token
Batching benefit	High — amortizes weight loading across inputs	Very high — many decode requests share one weight read
Memory usage	Spikes during attention computation	Grows linearly with generated tokens (KV cache)

Key insight for continuous batching: Since decode is memory-bandwidth-bound, adding more requests to a decode batch costs almost zero additional time (the GPU reads model weights once regardless of batch size). This is why continuous batching achieves 10-30x better throughput than static batching — the GPU is never waiting for a full batch to form, and new requests slot into the decode batch immediately after their prefill.

Scheduling optimization — chunked prefill:

A long prefill (e.g., 100K tokens) can block decode steps for hundreds of milliseconds, spiking TTFT for other requests in the batch.
Solution: break prefill into chunks (e.g., 512 tokens per chunk). Between chunks, run a decode step for the rest of the batch. This interleaves prefill and decode, keeping per-token latency stable for existing streams.

Trade-off: Chunked prefill adds ~10% more total prefill time (overhead from interleaving) but keeps decode latency stable. Without chunking, a 100K-token prefill would stall all 31 other decode streams in the batch for 2+ seconds. I accept 10% longer prefill to maintain stable per-token latency for all concurrent users.

"A customer reports time-to-first-token is 3x normal. Debug this."

IC7 signal: "I'll walk through a systematic debugging framework, not just guess. The right approach is: scope the problem, check the dashboards, narrow to a component, and identify root cause. This is how I'd lead an incident."

Step 1 — Scope:

Is this one customer or multiple? → Check TTFT dashboards filtered by org_id. If one org, likely their request pattern. If global, likely infrastructure.
Is this one model or all models? → Filter by model. If one model, likely that model's cluster. If all, likely shared infrastructure (gateway, scheduler).
Is this one region or all regions? → Filter by region. If one region, likely regional capacity or network issue.

Step 2 — Component-by-component latency breakdown:

Normal TTFT breakdown (Sonnet, 2K input tokens):
Gateway overhead: ~5ms
Safety input classifier: ~20ms
Queue wait (scheduler): ~50ms
Prefill (GPU): ~200ms
First decode token: ~15ms
Output safety check: ~2ms
Network to client: ~10ms
Total: ~302ms

If TTFT is 3x (900ms), the extra 600ms is in one of:
Queue wait → 650ms (overloaded GPU cluster, check utilization)
Prefill → 800ms (very long input, check input_tokens distribution)
Gateway → 605ms (rate limit contention, check Redis latency)

Step 3 — Most likely causes (in order of probability):

Queue wait spike: Their requests are waiting for GPU allocation. Check: tenant scheduler queue depth for this org. Cause: another high-priority org is consuming GPU capacity, or their traffic spiked and exceeded concurrency cap. Fix: if legitimate, scale GPU pool or adjust weights.
Long input context: Customer started sending 50K-token inputs instead of 2K. Prefill scales linearly with input length. Check: avg input_tokens for this org's recent requests. Fix: this is expected behavior — communicate to customer that longer contexts have higher TTFT.
Model cold start: Their requests are hitting a newly scaled-up worker that's still loading model weights (~30-60 seconds for a large model). Check: worker age distribution. Fix: improve warm pool sizing, preload models on standby workers.
KV cache fragmentation: GPU memory is fragmented after many allocations/deallocations, forcing garbage collection pauses. Check: GPU memory fragmentation metrics. Fix: implement PagedAttention (vLLM-style) to eliminate external fragmentation.
Network/infrastructure: Rarely the cause for TTFT, but check: gateway latency spikes, Redis (rate limiter) latency, cross-AZ network latency if scheduler and workers are in different AZs.

IC7 insight: The debugging framework matters more than the specific diagnosis. I'd ensure the observability stack includes: per-request distributed tracing (each component adds a span), per-component latency histograms (P50/P99/P999), and per-org breakdowns of all latency components. If we don't have this instrumentation, the first action item is to add it — you can't debug what you can't measure.

Q10BHARAT'S COST PROBE

"How do you optimize cost per query while maintaining SLOs?"

IC7 signal: "GPU cost is the dominant line item — likely 60-70% of COGS for an inference company. I'll walk through the optimization hierarchy, from biggest impact to smallest, with the SLO trade-off for each."

Cost optimization hierarchy (ordered by impact):

Optimization	Savings	SLO Impact	Complexity
1. KV cache reuse (prompt caching)	30-50% GPU time for repeat prefixes	Improves TTFT (skip prefill)	Medium
2. Continuous batching	10-30x throughput vs. static	Neutral (improves utilization)	High (already assumed)
3. Speculative decoding	2-3x decode speed	Improves per-token latency	High
4. Model distillation / tiering	10x cost difference (Haiku vs Opus)	Quality trade-off (user choice)	Low (product decision)
5. Spot/preemptible GPUs for batch	60-70% on batch workloads	No SLO (batch is async)	Medium
6. Quantization (INT8/FP8)	~2x memory efficiency	Minimal quality loss (<0.5%)	Medium
7. Right-sizing max_tokens	5-10% KV cache savings	Neutral	Low

The biggest lever — KV cache reuse:

Many API requests share common prefixes: system prompts, few-shot examples, RAG context. If we cache the KV states for these prefixes, subsequent requests skip prefill for the shared portion.
Implementation: hash the prefix tokens → look up in a distributed KV cache (Redis or custom GPU-memory cache). If hit, load the cached KV states directly to GPU memory. If miss, run normal prefill and cache the result.
For a customer whose system prompt is 2,000 tokens (common for API users), caching saves ~100-200ms of prefill per request. At 8,333 req/sec, that's ~1,000 GPU-seconds saved per second.

Speculative decoding:

Use a small draft model (Haiku) to generate N candidate tokens, then verify them in one forward pass of the target model (Sonnet/Opus). If K of N candidates are accepted, we generated K tokens in the time of ~2 forward passes instead of K forward passes.
Acceptance rate depends on task: ~70-80% for straightforward text, ~40-50% for creative/reasoning. Average 2-3x speedup.

Cost savings example (KV cache + speculative decoding):
Base cost: 2,030 GPUs × $2/hr = $4,060/hr
KV cache (saves 35% prefill time, which is ~25% of total GPU time): -8.75% → saves $355/hr
Speculative decoding (2.5x decode speedup, decode is ~75% of GPU time): -50% decode → saves $1,522/hr
Optimized: ~$2,183/hr = $19.1M/year (vs. $35.5M baseline)
Savings: $16.4M/year = 46% reduction

Trade-off: Every optimization adds system complexity and failure modes. KV cache adds a distributed cache layer that can go stale (model version change invalidates all cached KV states). Speculative decoding adds a draft model to maintain and can regress acceptance rates on new input distributions. My approach: implement KV cache first (highest ROI, lowest risk), then speculative decoding for high-volume models only. Quantization last because it requires per-model quality validation.

Safety: Cost optimization NEVER bypasses the safety pipeline. The safety classifiers are lightweight (~20ms) relative to inference (3,000ms). Skipping them saves 0.7% latency at the cost of serving potentially harmful content. This is not a trade-off I'm willing to make.

Anthropic IC7 System Design

Section 1: Chat 1:1 System (Q4)

Section 2: Inference API (Q1)

Section 3: Quick-Reference — IC7 Opening Lines

Q4 — Chat 1:1 System (Jake Luciani)

Q1 — Inference API (Bharat Mediratta)