Anthropic System Design — Panel-Tailored Prep

Friday, May 22, 2026 | Two system design rounds, back-to-back | Senior/Staff level

On the Q4/Q1 mapping: "Q4" and "Q1" are Anthropic's internal question bank identifiers shown in the candidate portal. Recent candidate reports on forums suggest Q4 = chat/conversation design and Q1 = inference API design. However, multiple sources note Anthropic periodically rotates these mappings. One April 2026 post explicitly warns "the question mapping has changed." Prepare for both reported topics in each round. The question bank is small — if they don't ask chat in Q4, they'll likely pull from: inference batching, P2P file distribution, token-gen service, key-value store, or file cache. This doc covers the two reported topics deeply and gives you portable patterns for alternatives.

Your Schedule

Time (PT)RoundInterviewers
12:00–12:55System Design Q4Jake LucianiAlexander Knott
12:55–1:00Break
1:00–1:50Non-Technical CultureNatalie Bauman + Alon Krifcher
1:50–2:00Break
2:00–2:55System Design Q1Bharat MedirattaAnton Paquin

Round 1: System Design Q4 — Chat / Conversation System

Know Your Interviewer

Jake Luciani — 16 years in distributed systems. Former Chief Architect at DataStax (commercial Cassandra). Apache Cassandra PMC member. Built the Stargate data gateway. Contributed to Apache Thrift. His DNA is data modeling, partition key design, consistency models, and storage at scale.

What this means for you: Jake will almost certainly deep-dive on your data model. He'll want to know your partition key choices, why you chose NoSQL vs SQL, how you handle message ordering, and what your read/write patterns look like at scale. Have a schema ready on paper before you walk in.

A. What They're Really Asking

Stripped of Anthropic framing: "Design a real-time 1:1 messaging system with persistent history." This is WhatsApp/Slack at its core. At Anthropic, the twist is that one participant is an LLM, which adds token streaming, billing, and safety filtering — but the infrastructure problem is still a messaging system.

A senior/staff candidate recognizes: this is a conversation management + real-time delivery + storage problem. The AI model is a black box participant. Don't spend time on inference — focus on the system around it.

B. Requirements to Drive (State These in the First 3 Minutes)

Functional:

Non-functional:

C. Back-of-Envelope

D. High-Level Design

Components (describe in this order):

  1. API Gateway / Load Balancer — TLS termination, auth (API keys + session tokens), rate limiting per user/org. Routes HTTP requests and upgrades WebSocket connections.
  2. Chat Service (stateless) — Handles message submission, conversation CRUD, history retrieval. Stateless pods behind the LB. Publishes messages to the message bus.
  3. Connection Manager (stateful) — Manages persistent WebSocket/SSE connections. Maps user_id → connection_id → server_id. This is the only stateful component — use consistent hashing for connection routing.
  4. Message Bus — Kafka or similar. Partitioned by conversation_id for ordering. Decouples ingestion from processing and delivery.
  5. Safety PipelineSafety Sits between user input and model dispatch. Constitutional Classifiers run on user messages before forwarding to inference. Output filters run on model responses before delivery. Async but bounded latency (< 50ms budget).
  6. Inference Router — Dispatches to the appropriate model cluster based on conversation's model selection. Handles streaming token responses back through the connection manager.
  7. Message Store — Persistent storage. Wide-column store (Cassandra) for conversation history. Redis for recent message cache and presence.
  8. Notification Service — Push notifications for offline users, email digests.

Data flow: User sends message → API Gateway → Chat Service → Safety Pipeline (filter) → Message Bus → (1) Message Store (persist) + (2) Inference Router (generate response) → Model generates tokens → Safety Pipeline (output filter) → Connection Manager → stream tokens to user via WebSocket/SSE.

E. Data Model Deep Dive (Jake's Focus Area)

Conversations table:

Partition key: user_id
Clustering key: updated_at DESC, conversation_id

user_id      | updated_at           | conversation_id | title    | model_version | created_at
-------------|----------------------|-----------------|----------|---------------|----------
u_123        | 2026-05-22T14:30:00  | conv_abc        | "Debug…" | claude-4-opus | 2026-05-20

Why this partition key: A user's conversation list is the most common query. Clustering by updated_at DESC gives you "most recent first" for free — no sorting at read time.

Messages table:

Partition key: conversation_id
Clustering key: message_id (Snowflake/ULID — time-ordered, globally unique)

conversation_id | message_id       | sender     | role       | content     | token_count | created_at
----------------|------------------|------------|------------|-------------|-------------|----------
conv_abc        | 01HZXY...001     | u_123      | user       | "Help me…"  | 45          | ...
conv_abc        | 01HZXY...002     | model      | assistant  | "Sure, …"   | 312         | ...

Why wide-column (Cassandra) fits:

Why NOT relational (anticipate Jake's pushback):

Ordering guarantees:

F. Streaming & Real-Time Delivery

G. Consistency & Ordering

H. Safety Architecture Lead With This

State this proactively before Jake asks:

I. Questions Jake Will Ask (and How to Answer)

1. "Why not just use PostgreSQL with a conversation_id foreign key?"
Lead with: Works at moderate scale. But at 500M messages/day, B-tree write amplification becomes the bottleneck. Cassandra's LSM tree handles sustained writes at this ratio. The real trade-off is giving up ad-hoc joins — compensate with a search index for cross-conversation queries.
2. "What's your partition key for messages, and what happens when a conversation gets very long?"
Lead with: Partition by conversation_id. Risk: very long conversations (10K+ messages) create hot partitions. Mitigation: bucket the partition key — conversation_id + time_bucket (e.g., daily). Trade-off: cross-bucket reads for full history require multiple partition reads, but recent-messages queries (90% of reads) hit a single bucket.
3. "How do you guarantee message ordering if your Chat Service has multiple replicas?"
Lead with: Within a conversation, there's a natural single-writer constraint (user waits for model response before sending next). Kafka partition by conversation_id preserves order. If you need concurrent senders (future group chat), use a sequence number assigned by a lightweight coordinator per conversation.
4. "What consistency level do you read at, and why?"
Lead with: Write at QUORUM (majority acknowledge). Read at QUORUM for recent messages (strong consistency within conversation). Read at ONE for historical messages (eventual consistency acceptable, lower latency). This gives you tunable consistency per access pattern.
5. "How do you handle a user reconnecting mid-stream during an AI response?"
Lead with: In-flight response tokens are buffered in Redis with a short TTL (5 min). Client reconnects with Last-Event-Id. Server replays from the buffer. If buffer expired, serve the partially-completed response from the Message Store and trigger a continuation from the model.
6. "What happens if the safety classifier is slow? Do you block the response?"
Lead with: Safety has a latency budget (50ms). If exceeded: fail closed for input (reject the message), fail closed for output (stop streaming). At Anthropic, never deliver unfiltered content. Optimize classifier latency with model distillation and batching.
7. "How would you add full-text search across all conversations for a user?"
Lead with: Separate concern. Kafka consumer feeds messages to Elasticsearch, indexed by user_id + content + timestamp. Cassandra is the system of record; ES is a read-optimized projection. Accept eventual consistency (seconds of delay) on the search index.
8. "Walk me through what happens when a Cassandra node goes down."
Lead with: RF=3 across AZs. One node down → reads/writes at QUORUM still succeed (2 of 3). Hinted handoff queues writes for the downed node. When it recovers, anti-entropy repair syncs it. User sees zero impact. Two nodes down → QUORUM fails, degrade to read at ONE (stale reads OK) while alerting.

Round 2: System Design Q1 — Inference API

Know Your Interviewer

Bharat Mediratta — Google Distinguished Engineer (built/maintained GWS — Google Web Server, the system serving ALL Google web traffic). CTO at Dropbox. CTO at AltSchool. First Engineering Fellow at Coinbase. He has 20+ years of building the most demanding serving infrastructure on the planet.

What this means for you: Bharat will probe API contract precision, multi-tenant fairness, SLOs, failure mode enumeration, capacity planning methodology, and cost optimization. He's seen every anti-pattern. He'll push you on: what happens at 10× expected load? How do you prevent one tenant from starving others? What are your SLIs and how do you alert on them? Don't hand-wave. Be precise.

A. What They're Really Asking

Stripped of framing: "Design a multi-tenant, multi-model API serving layer for computationally expensive requests on constrained hardware (GPUs)." This is a rate-limited API gateway + job scheduler + GPU resource manager. Think: Google's serving stack for expensive compute operations, not a simple CRUD API.

A senior/staff candidate recognizes: the core challenges are (1) multi-tenant isolation on shared GPU resources, (2) SLO management when inference times vary 10-100× by request, and (3) graceful degradation under load.

B. Requirements to Drive

Functional:

Non-functional:

C. Back-of-Envelope

D. High-Level Design

  1. API Gateway — TLS termination, authentication (API key → org_id lookup), request validation, rate limiting (token bucket per org + global), request ID assignment.
  2. Request Router — Routes to correct model cluster based on model + version in request. Weighted routing for canary deployments (e.g., 5% to new version). Admission control: rejects requests when queue depth exceeds threshold.
  3. Model Registry — Immutable model versions. Maps claude-4-sonnet-20260520 → model artifact location + serving config. Supports aliases (claude-4-sonnet-latest) and pinned versions.
  4. Request Queue — Per-model, priority-weighted queue. Priority levels: real-time sync > streaming > batch. Within same priority: weighted fair queuing by org (prevents one large org from starving others).
  5. GPU Scheduler — Core component. Manages GPU fleet. Implements continuous batching: new requests join in-flight batches during decode phase. Tracks KV cache per request. Handles preemption (low-priority batch requests yield to real-time requests).
  6. Inference Workers — Stateful pods, each owns 1-8 GPUs. Runs model inference. Reports health, utilization, and queue depth to scheduler. Handles token generation and streams tokens back.
  7. Streaming Proxy — Manages SSE connections to clients. Buffers tokens from inference workers, flushes to clients. Handles client disconnection gracefully (stops inference to reclaim GPU).
  8. Safety PipelineSafety Pre-inference: input classifiers (prompt injection, harmful content). Post-inference: output classifiers on generated text. Runs on CPU to avoid consuming GPU budget.
  9. Observability Layer — SLI collection (TTFT, per-token latency, error rate per org/model), distributed tracing, GPU utilization dashboards, cost accounting per org.

E. API Design Deep Dive (Bharat's Focus Area)

Primary endpoint:

POST /v1/messages
Headers:
  x-api-key: sk-ant-...
  anthropic-version: 2026-05-01
  x-request-id: req_uuid (idempotency key)

Body:
{
  "model": "claude-4-sonnet-20260520",
  "max_tokens": 4096,
  "messages": [{"role": "user", "content": "..."}],
  "stream": true,
  "metadata": {"user_id": "end-user-123"}  // for per-user safety tracking
}

Response (streaming):
event: message_start
data: {"id": "msg_01X...", "type": "message_start", "model": "claude-4-sonnet-20260520"}

event: content_block_delta
data: {"type": "content_block_delta", "delta": {"text": "Hello"}}
...
event: message_stop
data: {"type": "message_stop", "usage": {"input_tokens": 45, "output_tokens": 312}}

Rate limiting (Bharat will probe this):

Idempotency (Bharat cares about this from GWS):

Error handling:

Versioning strategy:

F. GPU Scheduling & Batching

G. Multi-Tenant Isolation (Bharat's Expertise — He Built GWS)

H. SLOs & Reliability

SLIs (what you measure):

SLOs (what you promise):

Failure modes Bharat will ask about:

I. Model Loading & Rollouts

J. Safety Architecture Lead With This

K. Questions Bharat Will Ask (and How to Answer)

1. "How do you prevent one tenant from starving all others during a traffic spike?"
Lead with: Weighted fair queuing at the request queue level. Each org gets a weight. When queue depth exceeds capacity, scheduler dequeues proportionally to weights. Combine with per-org concurrency caps. Measure fairness as "GPU-seconds per org" not "requests per org" — a long Opus request consumes 20× a short Haiku request.
2. "What's your capacity planning methodology?"
Lead with: Start from SLO backwards. Target TTFT P99 < 2s means queue wait must be < 1.5s (reserving 500ms for prefill). Apply Little's Law: max queue depth = arrival rate × max wait time. GPU fleet sized to drain the queue within that wait time at peak. Add 50% headroom for burst + failover. Review monthly against actual utilization; autoscale horizontally for organic growth.
3. "Walk me through what happens when a GPU OOMs mid-inference."
Lead with: Immediate: that request fails, client gets an error event in the SSE stream. All other requests on the same GPU are affected (shared batch). Worker reports unhealthy → scheduler stops routing to that worker. Worker restarts, reloads model weights from NVMe cache (~30s). Requests that were in-flight are retried by clients using idempotency keys. Systemic: if OOMs are frequent for a model version, circuit breaker triggers auto-rollback to previous version.
4. "How do you handle a request that takes 10× longer than expected?"
Lead with: Server-side timeout per request (max_tokens × expected_per_token_latency × 3). If exceeded, terminate inference, return partial response + timeout error. The real risk is KV cache: a runaway request holds GPU memory, blocking new requests. Enforce a max_tokens hard cap (server-side, regardless of client request). For batch API: separate timeout (hours, not seconds).
5. "How do you do a safe model rollout without risking the entire fleet?"
Lead with: Canary → ramp → full rollout. Canary: 2% of traffic to new version. Monitor: TTFT, error rate, safety trigger rate, GPU utilization. Automated promotion gates: if all SLIs within X% of baseline for 1 hour, ramp to 10%. Human approval gate before 50%. Blast radius: canary runs in a single AZ initially. Rollback: flip the model registry alias back to previous version — takes effect within seconds for new requests.
6. "What's the cost per query, and how do you optimize it?"
Lead with: Cost = GPU-seconds per query. Optimization levers: (1) continuous batching increases GPU utilization from ~30% to 80%, (2) KV cache sharing for common prompt prefixes (system prompts), (3) quantization (INT8/FP8) reduces memory → larger batch sizes, (4) speculative decoding for smaller models, (5) request routing: short requests to smaller/cheaper models when appropriate. Track cost per 1M tokens as the business metric.
7. "How do you handle the client disconnecting mid-stream?"
Lead with: Streaming proxy detects TCP FIN. Propagates cancellation upstream to inference worker. Worker stops generating tokens and frees KV cache immediately. This is critical for cost — a disconnected client still consuming GPU is pure waste. Edge case: client reconnects within 5s → check if inference is still running → resume streaming. After 5s → inference is cancelled, client must re-request.
8. "How do you handle a complete region failure?"
Lead with: Multi-region active-active. DNS-based failover (Route53 health checks). Each region sized at 150% of expected load. When a region fails: (1) DNS health check fails within 30s, (2) traffic shifts to surviving regions, (3) admission control may shed lowest-priority traffic if combined load exceeds capacity, (4) autoscaler provisions additional GPUs (minutes-scale). No data loss: request state is ephemeral (in-flight only), conversation history in the chat system is the durable store.

Section 3: Pacing Cheat Sheet

55-Minute Breakdown (Both Rounds)

0–3 min
Clarify scope + state requirements. Don't ask too many questions — propose constraints and confirm.
3–7 min
Back-of-envelope. 2-3 key numbers. State them and move on.
7–22 min
High-level design. Walk through all components. Mention safety proactively.
22–47 min
Deep dives. For Q4: data model + streaming. For Q1: API design + multi-tenant isolation. The interviewer will steer — follow their interest.
47–52 min
Trade-offs and what you'd do differently. Mention monitoring, cost, evolution.
52–55 min
Questions for interviewer. "What's the hardest scaling challenge your team faced recently?"

How to signal seniority:


Section 4: Interviewer-Specific Tips

Jake Luciani (Q4 — 12:00 PM)

His lens: Data. Storage. Consistency. He spent his career on Cassandra and distributed data systems. When you describe your architecture, he's mentally evaluating your data model before you even get to it.

Bharat Mediratta (Q1 — 2:00 PM)

His lens: Serving infrastructure at planetary scale. He built Google Web Server. He was CTO at Dropbox when it served 500M+ users. He sees systems through the lens of: API contracts, multi-tenant fairness, SLO enforcement, cost efficiency, and operational maturity.

If the Question Is NOT Chat / Inference API

If Anthropic rotated the Q-numbers, here's what else might come up (from the known question bank) and how to pivot:

The portable patterns across ALL these: back-of-envelope math drives design, state trade-offs before asked, enumerate failure modes, mention safety, and propose SLOs proactively.

Built for Raunaq Naidu | Anthropic System Design | May 22, 2026 | Amika