Anthropic System Design -- Cheat Sheet

Friday May 22, 2026 · 12:00 PM PT · Two 60-minute rounds · Coffee-table reference

60-Minute Pacing

0 - 5Requirements. Clarify scope, users, scale. State the safety constraint upfront. 5 - 10Envelope math. One or two key calculations. Anchor the design in real numbers. 10 - 25High-level design. Components, data flow, API contract. Sketch the happy path. 25 - 50Deep dives. Two areas max. Failure modes, safety integration, scaling bottlenecks. 50 - 55Wrap. Summarize trade-offs, acknowledge what you'd improve with more time. 55 - 60Questions for them. Ask about the team's actual infra challenges.

At staff level, you drive. Don't ask "what should I focus on?" -- propose a direction and ask if they want to go deeper.

The Safety-First Framework

Reusable 3-step pattern for weaving safety into any answer. This is what separates your answer from a FAANG answer.

Step 1

State It Upfront

Name the safety constraint in requirements. "A system that's fast but produces harmful output is broken -- even if availability SLIs are green."

Step 2

Embed in Architecture

Show where the safety layer lives in the HLD. Not bolted on -- baked in. Multi-layer: rule filters (microseconds), ML classifiers (parallel), human escalation.

Step 3

Failure Modes

What happens when the safety layer fails? Fail-closed vs fail-open. Confidence-based routing. Feedback loops from human review back into classifiers.

Top 5 Question Quick-Reference

1 Inference Batching System

Single GPU, up to 100 inputs/batch. Users submit synchronously, wait for results.

Key Requirements

Synchronous user-facing API with async backend batching
Max batch size = 100, GPU is the constrained resource
Correct response routing back to the waiting user
Graceful handling when volume < batch size (timeout-based flush)

Envelope: If avg request = 500 tokens, GPU processes 100-req batch in ~2s, max throughput = 50 req/s per GPU. At 10K req/s you need ~200 GPUs. Queuing delay = f(batch fill time); set max wait = 100ms before flushing partial batch.

HLD: API Gateway → Request Queue (per-GPU or shared) → Batch Assembler (fills to 100 or timeout) → GPU Inference → Response Router (maps result[i] back to caller via request ID + future/promise) → User.

Deep dive 1: Async-to-sync mapping -- each request gets a unique ID; caller blocks on a future; batch assembler resolves the future when GPU returns. Handles partial batches via timeout flush.
Deep dive 2: Continuous batching -- new requests join in-progress batches as slots free (tokens complete). Maximizes GPU utilization without restarting batch cycles.

Anthropic angle Output passes through safety classifiers before resolving the future. Batch the safety checks too -- amortize classifier cost across the batch. If a response is flagged, return a safe refusal to that specific caller without blocking others.

2 Peer-to-Peer File Distribution

Large file (model binary) to thousands of machines. Single bandwidth-constrained source.

Key Requirements

Distribute multi-GB file to 10K+ machines
Source has limited upload bandwidth (e.g. 10 Gbps)
Each machine has limited ingress/egress bandwidth
Minimize total distribution time, verify integrity

Envelope: 50 GB file, source at 10 Gbps = 40s to upload one copy. Sequential to 10K machines = 111 hours. Chunk into 256 MB pieces (200 chunks), each machine seeds completed chunks while downloading others. BitTorrent-style: O(log N) rounds to saturate, ~minutes not hours.

HLD: Tracker/Coordinator (knows who has which chunks) → Source seeds initial chunks to distinct peers → Peers exchange chunks (rarest-first strategy) → Integrity via chunk checksums (SHA-256) → Completion signal back to coordinator.

Deep dive 1: One-at-a-time vs fan-out trade-off. Sequential = full bandwidth per transfer but O(N) total. Fan-out = split bandwidth but parallel. P2P = best of both -- source sends each chunk once, peers multiply it.
Deep dive 2: Rack-awareness -- prefer intra-rack transfers (higher bandwidth, lower latency). Reduces cross-rack traffic. Coordinator assigns chunk targets by topology.

Anthropic angle These are model weights being distributed. Weight theft is an ASL-3 security concern. Encrypt chunks in transit + at rest. Authenticate peers via mTLS. Egress bandwidth monitoring to detect exfiltration -- Anthropic literally does this per their ASL-3 security controls.

3 API Gateway & Rate Limiter

Developer-facing API for model access. Auth, rate limiting, routing, observability.

Key Requirements

10K+ concurrent connections, tiered rate limits (free/paid/enterprise)
Token bucket or sliding window per API key
Versioned endpoints (/v1/generate, /v1/embed, /v1/moderate)
Streaming responses (SSE) for long generations

Envelope: 100K API keys, avg 10 req/min per key = ~17K req/s. Rate limit state in Redis: 100K keys x 32 bytes = ~3 MB (trivially fits). Token bucket: refill rate = tier limit / 60, burst = 2x base. 429 with Retry-After header on exceed.

HLD: Client → TLS termination (edge) → Auth middleware (API key lookup + tier) → Rate limiter (Redis token bucket) → Request router (version + model routing) → Inference cluster. Response path: inference → safety filter → SSE stream to client. Async logging to Kafka.

Deep dive 1: Backpressure -- when inference queue is full, return 503 + queue position or reject immediately? Enterprise gets priority queue. Circuit breaker on inference backend. Graceful degradation: route to smaller fallback model if primary is overloaded.
Deep dive 2: Observability -- per-customer latency dashboards, token usage, error rates. Distributed tracing (request ID through entire pipeline). Real-time anomaly detection on usage patterns.

Anthropic angle The gateway is the policy enforcement point. Rate limits aren't just about fair usage -- they prevent automated abuse at scale. Log every request for compliance audit trail (SOC 2). The /v1/moderate endpoint exposes safety classification directly to developers. Tiered access: enterprise customers may get exemptions from certain classifier thresholds for legitimate dual-use research (Anthropic does this with ASL-3 access controls).

4 Safety & Moderation Pipeline

Real-time content filtering for model outputs. Multi-layer, balancing accuracy vs latency.

Key Requirements

Sub-100ms added latency budget for safety checks
Multiple harm categories (CBRN, hate speech, PII, bias)
False negatives are worse than false positives (at Anthropic)
Feedback loop: human review improves automated classifiers

Envelope: Rule-based filters: <1ms (regex, deny lists). ML classifiers: 10-30ms each, run in parallel across categories. Confidence-based routing: high-confidence safe (>0.95) = pass-through; high-confidence violation (>0.98) = auto-block; uncertain middle band = additional processing or human queue. Target: 95%+ of requests resolved in <50ms.

HLD: Model output → Layer 1: Rule-based filters (keyword, regex, microseconds) → Layer 2: ML classifiers (parallel per harm category, 10-30ms) → Decision engine (confidence routing) → [Pass / Block / Escalate to human queue]. Async: all decisions logged to compliance pipeline. Human review results feed back into classifier retraining.

Deep dive 1: Constitutional Classifiers -- Anthropic's actual approach. Train input/output classifiers on synthetic harmful + harmless examples. Guard model inputs AND outputs. Constitutional AI as the training signal, not just human labels.
Deep dive 2: Adversarial robustness. Universal jailbreaks are the threat model. Bug bounty + automated red-teaming to find bypass patterns. When a jailbreak is discovered, generate synthetic variants and retrain the classifier rapidly.

Anthropic angle This IS Anthropic's core competency. Reference their 3-part ASL-3 approach: (1) make the system harder to jailbreak (Constitutional Classifiers), (2) detect jailbreaks when they occur (monitoring + bug bounty), (3) iteratively improve (synthetic jailbreak generation for retraining). A system that passes all availability SLIs but serves harmful content is broken.

5 Token-Generation Service at Scale

100K requests/second for LLM token generation. Throughput, horizontal scaling, routing.

Key Requirements

100K req/s sustained throughput
Sub-200ms time-to-first-token for small models, ~500ms-1s for large
Horizontal scaling across GPU clusters
Request routing by model size, estimated token count, priority tier

Envelope: If one GPU serves ~50 req/s (continuous batching), need ~2,000 GPUs for 100K req/s. At $2/hr per GPU = $4K/hr = $35M/yr in compute alone. KV cache: ~2 GB per concurrent session on large model; 10K concurrent = 20 TB aggregate. Quantization (INT8/INT4) reduces memory 2-4x.

HLD: API Gateway → Model Router (routes by model size + estimated complexity + GPU memory availability) → GPU Cluster (warm pools per model tier, continuous batching) → KV Cache Manager (per-session, evict on completion) → Token Streamer (SSE to client, token-by-token). Autoscaler watches queue depth + GPU utilization; warm pools sized for P95, autoscale for P99+ spikes.

Deep dive 1: Prefill vs decode phases. Prefill is compute-bound (process full prompt), decode is memory-bound (one token at a time, KV cache lookups). Different bottlenecks = different scaling strategies. Disaggregated serving: separate prefill and decode GPUs.
Deep dive 2: Graceful degradation. When at capacity: (1) route to smaller quantized model, (2) increase batch timeout to improve throughput, (3) shed load from free tier first. Circuit breakers per cluster. Never let overload cascade into safety-check bypass.

Anthropic angle Safety checks can't be skipped under load. When the system degrades, it must degrade toward MORE caution, not less. Overloaded = reject requests, don't serve unchecked responses. Safety classifiers need their own dedicated capacity that doesn't compete with inference GPUs. Monitor safety-check latency as a first-class SLI alongside token latency.

Anthropic-Specific Talking Points

Constitutional AI: Models trained via a "constitution" of principles + AI feedback, not just human RLHF. Self-critique and revision. The constitution constrains behavior at the training level, not just at inference-time filtering.

ASL Framework (Safety Levels): ASL-1 through ASL-4+. Currently operating under ASL-3 for Claude Opus 4. Higher ASL = stricter deployment + security controls. ASL-3 specifically targets CBRN weapon misuse and model weight theft from sophisticated non-state actors.

Constitutional Classifiers: Real-time input/output guards trained on synthetic harmful + harmless CBRN data. The actual mechanism behind ASL-3 deployment protections. Substantially reduces universal jailbreak success rates.

RSP v3.0 (Feb 2026): Separates unilateral company commitments from industry-wide recommendations. Introduced Frontier Safety Roadmap (public goals, publicly graded) and mandatory Risk Reports every 3-6 months. Acknowledges that ASL-4+ may require government coordination.

Karpathy just joined Anthropic's pretraining team (this week). Massive signal of talent consolidation. Good to reference naturally -- shows you're tracking the company.

Safety is a product feature, not a tax. Anthropic's framing: safety constraints drive better engineering. The safety pipeline is the most interesting distributed systems problem in the stack. Frame your enthusiasm around this.

Egress bandwidth controls: A unique ASL-3 security measure. Rate-limit outbound data from GPU clusters to make model weight exfiltration detectable before completion. Great example of creative infra thinking for security.

Common Pitfalls

Don't

Get intimidated by AI framing. "Batch inference on a GPU" = "batched processing on a constrained compute resource." The model is a black box. Strip the AI terminology in the first 30 seconds and solve the infra problem.

Don't

Wait for the interviewer to steer you. At staff level, they give the prompt and go quiet. "What should I focus on?" is a red flag. Propose a direction: "I think the most interesting trade-off here is X -- I'd like to go deep there unless you'd prefer Y."

Don't

Present a textbook answer and stop. Anthropic's questions are often problems their teams are actively solving. There may be no single correct answer. Show you can explore trade-offs and consider alternatives, not reproduce a memorized architecture.

Don't

Treat safety as an afterthought. "And then we could add a moderation layer" at minute 50 is a fail. At Anthropic, safety is a first-class architectural constraint from minute 1. If your design can bypass safety under load, your design is broken.

Built for Raunaq · May 21, 2026 · Good luck tomorrow.