Anthropic System Design — Panel-Tailored Prep
Friday, May 22, 2026 | Two system design rounds, back-to-back | Senior/Staff level
On the Q4/Q1 mapping: "Q4" and "Q1" are Anthropic's internal question bank identifiers shown in the candidate portal. Recent candidate reports on forums suggest Q4 = chat/conversation design and Q1 = inference API design. However, multiple sources note Anthropic periodically rotates these mappings. One April 2026 post explicitly warns "the question mapping has changed." Prepare for both reported topics in each round. The question bank is small — if they don't ask chat in Q4, they'll likely pull from: inference batching, P2P file distribution, token-gen service, key-value store, or file cache. This doc covers the two reported topics deeply and gives you portable patterns for alternatives.
Your Schedule
| Time (PT) | Round | Interviewers |
| 12:00–12:55 | System Design Q4 | Jake LucianiAlexander Knott |
| 12:55–1:00 | Break | |
| 1:00–1:50 | Non-Technical Culture | Natalie Bauman + Alon Krifcher |
| 1:50–2:00 | Break | |
| 2:00–2:55 | System Design Q1 | Bharat MedirattaAnton Paquin |
Round 1: System Design Q4 — Chat / Conversation System
Know Your Interviewer
Jake Luciani — 16 years in distributed systems. Former Chief Architect at DataStax (commercial Cassandra). Apache Cassandra PMC member. Built the Stargate data gateway. Contributed to Apache Thrift. His DNA is data modeling, partition key design, consistency models, and storage at scale.
What this means for you: Jake will almost certainly deep-dive on your data model. He'll want to know your partition key choices, why you chose NoSQL vs SQL, how you handle message ordering, and what your read/write patterns look like at scale. Have a schema ready on paper before you walk in.
A. What They're Really Asking
Stripped of Anthropic framing: "Design a real-time 1:1 messaging system with persistent history." This is WhatsApp/Slack at its core. At Anthropic, the twist is that one participant is an LLM, which adds token streaming, billing, and safety filtering — but the infrastructure problem is still a messaging system.
A senior/staff candidate recognizes: this is a conversation management + real-time delivery + storage problem. The AI model is a black box participant. Don't spend time on inference — focus on the system around it.
B. Requirements to Drive (State These in the First 3 Minutes)
Functional:
- 1:1 conversations between a user and an AI model (Claude)
- Persistent message history per conversation, resumable across sessions
- Real-time token streaming for AI responses (not wait-for-complete)
- Multi-conversation management (user has many threads)
- Message types: text, code blocks, images (multimodal)
Non-functional:
- Sub-100ms message delivery latency (user-to-system)
- Sub-200ms time-to-first-token for AI responses
- 99.9% availability (3-nines)
- Horizontal scalability to 10M+ concurrent conversations
- Message ordering guarantees within a conversation
- Safety Content filtering on both user input and AI output before delivery
C. Back-of-Envelope
- Scale: 100M registered users, 10M DAU, avg 5 conversations/day, 10 messages/conversation
- Message throughput: 10M × 5 × 10 = 500M messages/day ≈ 6K messages/sec
- Peak: 3× average ≈ 18K messages/sec
- Storage per message: ~2KB (metadata + avg 500 tokens × 4 bytes) → 500M × 2KB = 1TB/day raw
- Concurrent WebSocket connections: 10M (1 per active user)
- Read:Write ratio: ~10:1 (users re-read history much more than they write)
- Key insight for Jake: This is a write-heavy ingest with a read-heavy tail — classic Cassandra sweet spot
D. High-Level Design
Components (describe in this order):
- API Gateway / Load Balancer — TLS termination, auth (API keys + session tokens), rate limiting per user/org. Routes HTTP requests and upgrades WebSocket connections.
- Chat Service (stateless) — Handles message submission, conversation CRUD, history retrieval. Stateless pods behind the LB. Publishes messages to the message bus.
- Connection Manager (stateful) — Manages persistent WebSocket/SSE connections. Maps user_id → connection_id → server_id. This is the only stateful component — use consistent hashing for connection routing.
- Message Bus — Kafka or similar. Partitioned by conversation_id for ordering. Decouples ingestion from processing and delivery.
- Safety Pipeline — Safety Sits between user input and model dispatch. Constitutional Classifiers run on user messages before forwarding to inference. Output filters run on model responses before delivery. Async but bounded latency (< 50ms budget).
- Inference Router — Dispatches to the appropriate model cluster based on conversation's model selection. Handles streaming token responses back through the connection manager.
- Message Store — Persistent storage. Wide-column store (Cassandra) for conversation history. Redis for recent message cache and presence.
- Notification Service — Push notifications for offline users, email digests.
Data flow: User sends message → API Gateway → Chat Service → Safety Pipeline (filter) → Message Bus → (1) Message Store (persist) + (2) Inference Router (generate response) → Model generates tokens → Safety Pipeline (output filter) → Connection Manager → stream tokens to user via WebSocket/SSE.
E. Data Model Deep Dive (Jake's Focus Area)
Conversations table:
Partition key: user_id
Clustering key: updated_at DESC, conversation_id
user_id | updated_at | conversation_id | title | model_version | created_at
-------------|----------------------|-----------------|----------|---------------|----------
u_123 | 2026-05-22T14:30:00 | conv_abc | "Debug…" | claude-4-opus | 2026-05-20
Why this partition key: A user's conversation list is the most common query. Clustering by updated_at DESC gives you "most recent first" for free — no sorting at read time.
Messages table:
Partition key: conversation_id
Clustering key: message_id (Snowflake/ULID — time-ordered, globally unique)
conversation_id | message_id | sender | role | content | token_count | created_at
----------------|------------------|------------|------------|-------------|-------------|----------
conv_abc | 01HZXY...001 | u_123 | user | "Help me…" | 45 | ...
conv_abc | 01HZXY...002 | model | assistant | "Sure, …" | 312 | ...
Why wide-column (Cassandra) fits:
- Partition per conversation → all messages for a conversation co-located on the same node
- Clustering by time-ordered ID → sequential reads are a single disk sweep, not random I/O
- Write-optimized LSM tree → handles sustained write throughput without read amplification at this ratio
- Linear horizontal scale → add nodes to handle more conversations without resharding
Why NOT relational (anticipate Jake's pushback):
- Relational works at moderate scale but B-tree writes degrade under sustained message ingest
- Sharding a relational DB by conversation_id is effectively reinventing Cassandra's partition model with more operational overhead
- Trade-off: you lose ad-hoc cross-conversation queries (e.g., "search all my conversations") — solve this with a separate search index (Elasticsearch) fed from the Kafka stream
Ordering guarantees:
- Use ULIDs or Snowflake IDs — time-ordered + globally unique, no clock-sync dependency within a single conversation
- Within a conversation, the Chat Service assigns message IDs sequentially — single writer per conversation at any moment (user or model, never both simultaneously)
- Cross-conversation ordering doesn't matter (conversations are independent)
F. Streaming & Real-Time Delivery
- Protocol choice: SSE (Server-Sent Events) over WebSocket for this use case. Rationale: Claude's responses are unidirectional server→client streams. SSE is simpler (HTTP/2 compatible, no custom framing, auto-reconnect built in). WebSocket adds bidirectional overhead you don't need for token streaming.
- User messages: Standard HTTPS POST. No need for persistent connection for the send path.
- Token streaming: Model generates tokens → each token (or small batch of 3-5 tokens) pushed to SSE connection → client renders incrementally. Buffer flush every ~50ms or every N tokens, whichever comes first.
- Reconnection: Client reconnects with
Last-Event-Id header → server replays missed tokens from a short-lived buffer (Redis with 5-min TTL on in-flight response tokens).
- Connection Manager scaling: Sticky sessions via consistent hashing on user_id. If a Connection Manager pod dies, clients reconnect and land on a different pod — the message bus ensures no messages are lost.
G. Consistency & Ordering
- Exactly-once delivery: Idempotency key per message (client-generated UUID). Chat Service deduplicates on write using a short-lived idempotency cache (Redis SET NX, 5-min TTL). If the client retries, same message_id → no duplicate.
- Message ordering within conversation: Guaranteed by single-writer model (user sends → waits for model response → sends again). Kafka partition per conversation_id ensures ordered consumption.
- Eventual consistency on history reads: Cassandra read from quorum for recent messages, read from ONE for older history (acceptable staleness). Recent messages also served from Redis cache for sub-5ms latency.
- Failure modes: If safety pipeline times out → fail open with a flag for async review, OR fail closed (return error to user). Safety At Anthropic, default to fail closed — never deliver unfiltered content.
H. Safety Architecture Lead With This
State this proactively before Jake asks:
- Input filtering: User messages pass through Constitutional Classifiers before reaching the model. Classifies: prompt injection attempts, harmful content requests, PII in prompts.
- Output filtering: Model responses pass through a separate output classifier before token delivery. Can halt mid-stream if tokens form a harmful sequence (requires buffering a small lookahead window).
- Audit logging: Every message (input + output) logged to an immutable audit store with conversation_id, user_id, safety scores, and filter decisions. Required for incident investigation.
- PII handling: User messages may contain PII. Encrypt at rest (AES-256), restrict access to audit logs, implement retention policies (auto-delete after N days unless flagged).
- Rate limiting as safety: Per-user message rate limits prevent abuse patterns (rapid-fire prompt injection attempts).
I. Questions Jake Will Ask (and How to Answer)
1. "Why not just use PostgreSQL with a conversation_id foreign key?"
Lead with: Works at moderate scale. But at 500M messages/day, B-tree write amplification becomes the bottleneck. Cassandra's LSM tree handles sustained writes at this ratio. The real trade-off is giving up ad-hoc joins — compensate with a search index for cross-conversation queries.
2. "What's your partition key for messages, and what happens when a conversation gets very long?"
Lead with: Partition by conversation_id. Risk: very long conversations (10K+ messages) create hot partitions. Mitigation: bucket the partition key — conversation_id + time_bucket (e.g., daily). Trade-off: cross-bucket reads for full history require multiple partition reads, but recent-messages queries (90% of reads) hit a single bucket.
3. "How do you guarantee message ordering if your Chat Service has multiple replicas?"
Lead with: Within a conversation, there's a natural single-writer constraint (user waits for model response before sending next). Kafka partition by conversation_id preserves order. If you need concurrent senders (future group chat), use a sequence number assigned by a lightweight coordinator per conversation.
4. "What consistency level do you read at, and why?"
Lead with: Write at QUORUM (majority acknowledge). Read at QUORUM for recent messages (strong consistency within conversation). Read at ONE for historical messages (eventual consistency acceptable, lower latency). This gives you tunable consistency per access pattern.
5. "How do you handle a user reconnecting mid-stream during an AI response?"
Lead with: In-flight response tokens are buffered in Redis with a short TTL (5 min). Client reconnects with Last-Event-Id. Server replays from the buffer. If buffer expired, serve the partially-completed response from the Message Store and trigger a continuation from the model.
6. "What happens if the safety classifier is slow? Do you block the response?"
Lead with: Safety has a latency budget (50ms). If exceeded: fail closed for input (reject the message), fail closed for output (stop streaming). At Anthropic, never deliver unfiltered content. Optimize classifier latency with model distillation and batching.
7. "How would you add full-text search across all conversations for a user?"
Lead with: Separate concern. Kafka consumer feeds messages to Elasticsearch, indexed by user_id + content + timestamp. Cassandra is the system of record; ES is a read-optimized projection. Accept eventual consistency (seconds of delay) on the search index.
8. "Walk me through what happens when a Cassandra node goes down."
Lead with: RF=3 across AZs. One node down → reads/writes at QUORUM still succeed (2 of 3). Hinted handoff queues writes for the downed node. When it recovers, anti-entropy repair syncs it. User sees zero impact. Two nodes down → QUORUM fails, degrade to read at ONE (stale reads OK) while alerting.
Round 2: System Design Q1 — Inference API
Know Your Interviewer
Bharat Mediratta — Google Distinguished Engineer (built/maintained GWS — Google Web Server, the system serving ALL Google web traffic). CTO at Dropbox. CTO at AltSchool. First Engineering Fellow at Coinbase. He has 20+ years of building the most demanding serving infrastructure on the planet.
What this means for you: Bharat will probe API contract precision, multi-tenant fairness, SLOs, failure mode enumeration, capacity planning methodology, and cost optimization. He's seen every anti-pattern. He'll push you on: what happens at 10× expected load? How do you prevent one tenant from starving others? What are your SLIs and how do you alert on them? Don't hand-wave. Be precise.
A. What They're Really Asking
Stripped of framing: "Design a multi-tenant, multi-model API serving layer for computationally expensive requests on constrained hardware (GPUs)." This is a rate-limited API gateway + job scheduler + GPU resource manager. Think: Google's serving stack for expensive compute operations, not a simple CRUD API.
A senior/staff candidate recognizes: the core challenges are (1) multi-tenant isolation on shared GPU resources, (2) SLO management when inference times vary 10-100× by request, and (3) graceful degradation under load.
B. Requirements to Drive
Functional:
- Synchronous + streaming API for text generation (
/v1/messages)
- Multi-model support (Haiku, Sonnet, Opus) with version pinning
- Batch API for async bulk processing (
/v1/batch)
- Per-org authentication, billing, and usage tracking
- Token-level streaming (SSE) for real-time responses
Non-functional:
- Time-to-first-token (TTFT) P50 < 500ms, P99 < 2s for Sonnet
- Per-token latency P50 < 30ms
- 99.9% availability (monthly error budget: 43 min downtime)
- Multi-region (US, EU, APAC) for latency and compliance
- Per-org rate limits: token bucket (requests/min) + concurrency cap
- Safety All requests pass through safety pipeline before and after inference
C. Back-of-Envelope
- Scale: 100K orgs, 1M API keys, peak 500K requests/minute globally
- QPS: 500K/60 ≈ 8,300 req/s peak
- GPU throughput (Sonnet on H100): ~50 tokens/s per request, dynamic batching allows ~32 concurrent requests per GPU → ~1,600 tokens/s per GPU
- Average request: 500 input tokens (prefill) + 200 output tokens (decode)
- Prefill time: 500 tokens / 10K tokens/s ≈ 50ms per request (GPU-bound)
- Decode time: 200 tokens × 20ms/token ≈ 4s per request
- GPU count estimate: 8,300 req/s × 4s avg per request = ~33,200 concurrent requests → 33,200 / 32 batch size ≈ ~1,040 GPUs minimum for Sonnet alone
- Cost per 1M output tokens: ~$3 (GPU amortization + overhead) — informs pricing and rate limiting
D. High-Level Design
- API Gateway — TLS termination, authentication (API key → org_id lookup), request validation, rate limiting (token bucket per org + global), request ID assignment.
- Request Router — Routes to correct model cluster based on model + version in request. Weighted routing for canary deployments (e.g., 5% to new version). Admission control: rejects requests when queue depth exceeds threshold.
- Model Registry — Immutable model versions. Maps
claude-4-sonnet-20260520 → model artifact location + serving config. Supports aliases (claude-4-sonnet-latest) and pinned versions.
- Request Queue — Per-model, priority-weighted queue. Priority levels: real-time sync > streaming > batch. Within same priority: weighted fair queuing by org (prevents one large org from starving others).
- GPU Scheduler — Core component. Manages GPU fleet. Implements continuous batching: new requests join in-flight batches during decode phase. Tracks KV cache per request. Handles preemption (low-priority batch requests yield to real-time requests).
- Inference Workers — Stateful pods, each owns 1-8 GPUs. Runs model inference. Reports health, utilization, and queue depth to scheduler. Handles token generation and streams tokens back.
- Streaming Proxy — Manages SSE connections to clients. Buffers tokens from inference workers, flushes to clients. Handles client disconnection gracefully (stops inference to reclaim GPU).
- Safety Pipeline — Safety Pre-inference: input classifiers (prompt injection, harmful content). Post-inference: output classifiers on generated text. Runs on CPU to avoid consuming GPU budget.
- Observability Layer — SLI collection (TTFT, per-token latency, error rate per org/model), distributed tracing, GPU utilization dashboards, cost accounting per org.
E. API Design Deep Dive (Bharat's Focus Area)
Primary endpoint:
POST /v1/messages
Headers:
x-api-key: sk-ant-...
anthropic-version: 2026-05-01
x-request-id: req_uuid (idempotency key)
Body:
{
"model": "claude-4-sonnet-20260520",
"max_tokens": 4096,
"messages": [{"role": "user", "content": "..."}],
"stream": true,
"metadata": {"user_id": "end-user-123"} // for per-user safety tracking
}
Response (streaming):
event: message_start
data: {"id": "msg_01X...", "type": "message_start", "model": "claude-4-sonnet-20260520"}
event: content_block_delta
data: {"type": "content_block_delta", "delta": {"text": "Hello"}}
...
event: message_stop
data: {"type": "message_stop", "usage": {"input_tokens": 45, "output_tokens": 312}}
Rate limiting (Bharat will probe this):
- Three-tier token bucket: (1) Requests per minute per org, (2) Tokens per minute per org (input + output), (3) Concurrent requests per org
- Global rate limit: Admission control at the Request Router — if total queue depth across all orgs exceeds capacity, shed lowest-priority requests first
- Response headers:
x-ratelimit-limit, x-ratelimit-remaining, x-ratelimit-reset, retry-after
- 429 response: Include which limit was hit (requests vs tokens vs concurrency) so clients can adapt
Idempotency (Bharat cares about this from GWS):
- Client sends
x-request-id header. Server deduplicates: if same request_id seen within 24h, return cached response.
- Idempotency cache: Redis with 24h TTL. Key = org_id + request_id → response reference.
- For streaming: idempotent replay means re-streaming from buffer. If buffer expired, return the completed response as a single non-streaming response.
Error handling:
- Structured errors:
{"type": "error", "error": {"type": "rate_limit_error", "message": "..."}}
- Error types:
authentication_error, rate_limit_error, overloaded_error (503), invalid_request_error, api_error (500)
- Distinguish between retriable (429, 503, 529) and non-retriable (400, 401, 403)
- For streaming: if inference fails mid-stream, send an
error event in the SSE stream rather than silently closing the connection
Versioning strategy:
- Model version in body (
"model": "claude-4-sonnet-20260520"), NOT in URL path
- API version in header (
anthropic-version: 2026-05-01) — controls request/response schema, not model
- Aliases like
claude-4-sonnet-latest resolve server-side; response always includes the resolved version
F. GPU Scheduling & Batching
- Continuous batching: Don't wait for a full batch. New requests join in-flight batches during the decode phase. This is the key optimization — transforms GPU utilization from ~30% (static batching) to ~80%+.
- Prefill vs decode separation: Prefill is compute-bound (processes all input tokens at once). Decode is memory-bound (generates one token at a time, reads KV cache). Different GPU utilization profiles → consider separating prefill and decode onto different GPU pools for optimal scheduling.
- KV cache management: Each in-flight request holds KV cache proportional to (input_tokens + generated_tokens) × model_dim. Fixed GPU memory → KV cache is the bottleneck for batch size. When memory is full, either reject new requests or evict (preempt) low-priority requests' KV cache.
- Preemption: Batch API requests can be preempted (their KV cache swapped to CPU RAM) to make room for real-time requests. Resume later by swapping back. Trade-off: swap overhead (~100ms) vs keeping GPUs idle waiting for a new batch.
G. Multi-Tenant Isolation (Bharat's Expertise — He Built GWS)
- Weighted fair queuing: Each org gets a weight proportional to their tier (free, pro, enterprise). When the system is at capacity, higher-weighted orgs get proportionally more GPU time. No org gets zero throughput (starvation-free).
- Concurrency caps: Hard limit on concurrent in-flight requests per org. Prevents one org's batch job from consuming all GPU slots.
- Request isolation: Requests from different orgs can share a GPU batch (efficiency), but each org's usage is tracked independently for billing and rate limiting.
- Noisy neighbor protection: If one org submits many long-context requests (consuming disproportionate KV cache), their effective concurrency is reduced proportionally. Measure fairness in "GPU-seconds consumed" not "request count."
- Blast radius: Org-level circuit breaker. If an org's requests are causing elevated error rates (e.g., malformed prompts causing safety rejections), throttle that org specifically without impacting others.
H. SLOs & Reliability
SLIs (what you measure):
- Time-to-first-token (TTFT): measures queue wait + prefill time
- Inter-token latency (ITL): measures decode throughput
- Request success rate: (200 responses) / (total requests)
- Availability: (minutes without 5xx > 1%) / (total minutes)
SLOs (what you promise):
- TTFT P99 < 2s (Sonnet), < 5s (Opus)
- ITL P99 < 50ms
- Monthly availability: 99.9% (43 min/month error budget)
Failure modes Bharat will ask about:
- GPU failure mid-inference: Request fails → client retries → lands on different GPU. Streaming response: client sees connection drop + error event → reconnects with same request_id → idempotency returns cached partial response + continues from checkpoint (if supported) or restarts.
- Cascading failure: One model version has a bug causing OOM on GPUs → circuit breaker per model version. After N failures in M seconds, stop routing to that version, fall back to previous stable version.
- Thundering herd: Major customer launches a product → 10× traffic spike. Admission control sheds excess beyond capacity. Autoscaler triggers GPU provisioning (minutes-scale). In the meantime, priority queue ensures paid tiers degrade last.
- Regional failure: Multi-region active-active. DNS failover routes traffic to surviving regions. Each region sized at 150% of its expected load to absorb failover traffic.
I. Model Loading & Rollouts
- Model registry: Immutable artifacts. Each version = a specific set of model weights + serving config + safety classifier version. Stored in object storage (S3), cached on local NVMe of GPU nodes.
- Warm pools: Pre-provisioned GPU nodes with model weights loaded but not serving traffic. New requests can be routed immediately without cold-start. Maintain warm pool = 10-15% of active fleet.
- Canary deployments: New model version gets 2% of traffic for 1 hour. Monitor TTFT, error rate, safety trigger rate. If all SLIs within bounds → ramp to 10% → 50% → 100% over 4 hours. Auto-rollback if any SLI degrades beyond threshold.
- Cold start mitigation: Model weights are 10-100GB. Loading from S3 takes minutes. Mitigate with: (1) NVMe local cache on GPU nodes, (2) peer-to-peer weight distribution across nodes (BitTorrent-style), (3) warm pools pre-loaded with expected models.
J. Safety Architecture Lead With This
- Pre-inference safety: Input classifiers run on CPU (not GPU). Check for: prompt injection, jailbreak attempts, harmful content requests, PII patterns. Latency budget: < 50ms. If flagged → reject with
content_policy_error.
- Post-inference safety: Output classifiers run on generated text. For streaming: buffer a small token window (10-20 tokens) for lookahead classification before flushing to client. Trade-off: adds ~20ms to per-token delivery but prevents harmful content from reaching users.
- Constitutional Classifiers: Anthropic's approach — train safety classifiers using the same constitutional AI principles that train the models. These run as a separate service, version-pinned alongside the model version in the registry.
- Audit trail: Every request logged: org_id, model version, input/output tokens, safety classifier results, latency breakdown. Immutable log (append-only, encrypted). Retention: 90 days by default, configurable per-org for compliance.
- Safety SLO: False negative rate on harmful content < 0.01%. Safety pipeline availability must exceed the API's overall availability target (safety cannot be the single point of failure).
K. Questions Bharat Will Ask (and How to Answer)
1. "How do you prevent one tenant from starving all others during a traffic spike?"
Lead with: Weighted fair queuing at the request queue level. Each org gets a weight. When queue depth exceeds capacity, scheduler dequeues proportionally to weights. Combine with per-org concurrency caps. Measure fairness as "GPU-seconds per org" not "requests per org" — a long Opus request consumes 20× a short Haiku request.
2. "What's your capacity planning methodology?"
Lead with: Start from SLO backwards. Target TTFT P99 < 2s means queue wait must be < 1.5s (reserving 500ms for prefill). Apply Little's Law: max queue depth = arrival rate × max wait time. GPU fleet sized to drain the queue within that wait time at peak. Add 50% headroom for burst + failover. Review monthly against actual utilization; autoscale horizontally for organic growth.
3. "Walk me through what happens when a GPU OOMs mid-inference."
Lead with: Immediate: that request fails, client gets an error event in the SSE stream. All other requests on the same GPU are affected (shared batch). Worker reports unhealthy → scheduler stops routing to that worker. Worker restarts, reloads model weights from NVMe cache (~30s). Requests that were in-flight are retried by clients using idempotency keys. Systemic: if OOMs are frequent for a model version, circuit breaker triggers auto-rollback to previous version.
4. "How do you handle a request that takes 10× longer than expected?"
Lead with: Server-side timeout per request (max_tokens × expected_per_token_latency × 3). If exceeded, terminate inference, return partial response + timeout error. The real risk is KV cache: a runaway request holds GPU memory, blocking new requests. Enforce a max_tokens hard cap (server-side, regardless of client request). For batch API: separate timeout (hours, not seconds).
5. "How do you do a safe model rollout without risking the entire fleet?"
Lead with: Canary → ramp → full rollout. Canary: 2% of traffic to new version. Monitor: TTFT, error rate, safety trigger rate, GPU utilization. Automated promotion gates: if all SLIs within X% of baseline for 1 hour, ramp to 10%. Human approval gate before 50%. Blast radius: canary runs in a single AZ initially. Rollback: flip the model registry alias back to previous version — takes effect within seconds for new requests.
6. "What's the cost per query, and how do you optimize it?"
Lead with: Cost = GPU-seconds per query. Optimization levers: (1) continuous batching increases GPU utilization from ~30% to 80%, (2) KV cache sharing for common prompt prefixes (system prompts), (3) quantization (INT8/FP8) reduces memory → larger batch sizes, (4) speculative decoding for smaller models, (5) request routing: short requests to smaller/cheaper models when appropriate. Track cost per 1M tokens as the business metric.
7. "How do you handle the client disconnecting mid-stream?"
Lead with: Streaming proxy detects TCP FIN. Propagates cancellation upstream to inference worker. Worker stops generating tokens and frees KV cache immediately. This is critical for cost — a disconnected client still consuming GPU is pure waste. Edge case: client reconnects within 5s → check if inference is still running → resume streaming. After 5s → inference is cancelled, client must re-request.
8. "How do you handle a complete region failure?"
Lead with: Multi-region active-active. DNS-based failover (Route53 health checks). Each region sized at 150% of expected load. When a region fails: (1) DNS health check fails within 30s, (2) traffic shifts to surviving regions, (3) admission control may shed lowest-priority traffic if combined load exceeds capacity, (4) autoscaler provisions additional GPUs (minutes-scale). No data loss: request state is ephemeral (in-flight only), conversation history in the chat system is the durable store.
Section 3: Pacing Cheat Sheet
55-Minute Breakdown (Both Rounds)
0–3 min Clarify scope + state requirements. Don't ask too many questions — propose constraints and confirm.
3–7 min Back-of-envelope. 2-3 key numbers. State them and move on.
7–22 min High-level design. Walk through all components. Mention safety proactively.
22–47 min Deep dives. For Q4: data model + streaming. For Q1: API design + multi-tenant isolation. The interviewer will steer — follow their interest.
47–52 min Trade-offs and what you'd do differently. Mention monitoring, cost, evolution.
52–55 min Questions for interviewer. "What's the hardest scaling challenge your team faced recently?"
How to signal seniority:
- State trade-offs BEFORE being asked ("The trade-off here is X vs Y, and I'm choosing X because...")
- Propose SLOs proactively — don't wait for the interviewer to ask about availability
- Mention safety unprompted — "Before I route to inference, I want to run this through a safety classifier"
- Drive the conversation: "I think the most interesting deep-dive here is X. Want me to go deeper, or would you prefer Y?"
- Name failure modes before being asked: "What happens if this component goes down? Let me address that..."
Section 4: Interviewer-Specific Tips
Jake Luciani (Q4 — 12:00 PM)
His lens: Data. Storage. Consistency. He spent his career on Cassandra and distributed data systems. When you describe your architecture, he's mentally evaluating your data model before you even get to it.
- Have your schema ready in the first 15 minutes. Don't make him wait. Once you describe the HLD, immediately say "Let me walk through the data model" and write out partition keys and clustering keys.
- Know your consistency levels. "I'd read at QUORUM" isn't enough. Know why. Know when you'd relax to ONE. Know the replication factor and how it interacts with your consistency choice.
- Expect pushback on NoSQL. Jake knows Cassandra's strengths better than anyone, but he also knows its weaknesses. If you choose Cassandra, he'll test whether you genuinely understand the trade-offs or are just pattern-matching.
- Think in terms of read/write patterns. For every table you describe, state: what's the write path, what's the read path, what's the hot partition risk, what happens at 10× scale.
Bharat Mediratta (Q1 — 2:00 PM)
His lens: Serving infrastructure at planetary scale. He built Google Web Server. He was CTO at Dropbox when it served 500M+ users. He sees systems through the lens of: API contracts, multi-tenant fairness, SLO enforcement, cost efficiency, and operational maturity.
- Be precise on your API contract. Don't say "the client sends a request." Write out the endpoint, headers, and response schema. He'll respect the specificity.
- Enumerate failure modes systematically. For every component, state what happens when it fails. Don't wait to be asked. Bharat's GWS experience means he's thinking about every failure mode as you describe each component.
- Multi-tenant isolation is his wheelhouse. GWS serves requests for every Google product on shared infrastructure. He deeply understands weighted fair queuing, admission control, and blast radius containment. Go deep here — it's where you'll impress him most.
- Cost awareness signals maturity. A staff engineer doesn't just design for correctness — they design for cost. Mention GPU utilization percentages, cost per query, and optimization levers. Bharat at Coinbase would have cared about compute cost per transaction.
- Capacity planning methodology. Don't just say "add more GPUs." Walk through: SLO → maximum acceptable queue depth → required throughput → GPU count → headroom → failover capacity. Use Little's Law by name.
If the Question Is NOT Chat / Inference API
If Anthropic rotated the Q-numbers, here's what else might come up (from the known question bank) and how to pivot:
- Inference Batching System: Subset of the Inference API design — focus on the batching + GPU scheduling + async-to-sync mapping. Your Section 2 (F) covers this.
- P2P File Distribution: Large model binary → thousands of machines. Key: bandwidth is the constraint. BitTorrent-style chunking where each machine becomes a seed. Back-of-envelope: if source has 10 Gbps and file is 100GB, sending serially takes 80s per machine × 1000 = 22 hours. P2P with chunking: ~10 minutes total.
- Token Generation Service at 100K QPS: Horizontal scaling + request routing. This is a specialization of Section 2 — the same GPU scheduling and batching framework applies at higher QPS.
- Key-Value Store: Classic. Storage engine (LSM vs B-tree), replication (Raft or leaderless), partitioning (consistent hashing), consistency levels. Jake-friendly question.
- File Cache System: Eviction policies (LRU/LFU/ARC), cache coherence, write-through vs write-back, cache stampede prevention.
The portable patterns across ALL these: back-of-envelope math drives design, state trade-offs before asked, enumerate failure modes, mention safety, and propose SLOs proactively.
Built for Raunaq Naidu | Anthropic System Design | May 22, 2026 | Amika