Deep Questions & Answers — May 22, 2026 Panel Prep
Interviewers: Jake Luciani (Cassandra PMC, ex-DataStax Chief Architect) & Alexander Knott
Friday May 22, 12:00 – 12:55 PM PT
IC7 opening (first 30 seconds): "Before I draw anything — let me scope this. We're building the conversational interface for Claude: a user sends a text message, the system orchestrates inference, and streams tokens back. I'll focus on 1:1 chat, not group. I'll design for 50M monthly active users, 10 messages/conversation average, with P99 time-to-first-token under 500ms. Safety filtering is a first-class architectural constraint, not a bolt-on. I'll explicitly call out where I'm making trade-offs."
Requirements I'm driving:
Back-of-envelope:
Key architectural decisions:
IC7 signal: "I'll walk through both the Cassandra message table and the PostgreSQL metadata table, and explain why I split them — then I'll show you the partition key reasoning, because that's where this design lives or dies at scale."
| Column | Type | Key | Notes |
|---|---|---|---|
| conversation_id | UUID | Partition Key | All messages for one conversation co-located |
| message_id | ULID | Clustering Key ↓ DESC | Time-ordered, globally unique, lexicographic sort |
| role | TEXT | 'user' | 'assistant' | 'system' | |
| content | TEXT | Message body (may be large for Claude responses) | |
| token_count | INT | Metering: input or output tokens | |
| model_version | TEXT | 'claude-4-opus-20260501' | |
| safety_result | TEXT | 'pass' | 'filtered' | 'flagged' | |
| created_at | TIMESTAMP | Wall clock (ULID encodes this too) | |
| metadata | MAP<TEXT,TEXT> | Extensible: stop_reason, latency_ms, etc. |
| Column | Type | Key | Notes |
|---|---|---|---|
| conversation_id | UUID | PK | Matches Cassandra partition key |
| user_id | UUID | INDEX | Owner |
| title | TEXT | Auto-generated or user-set | |
| model | TEXT | Primary model used | |
| message_count | INT | Denormalized counter | |
| last_message_at | TIMESTAMP | INDEX | For "recent conversations" query |
| created_at | TIMESTAMP | ||
| is_archived | BOOLEAN | Soft delete |
SELECT * FROM conversations WHERE user_id = ? ORDER BY last_message_at DESC LIMIT 20Why ULIDs, not UUIDs or timestamps:
IC7 signal: "This is one of the trickiest UX problems in streaming chat — and the answer is different depending on whether we're using SSE or WebSocket. I chose SSE specifically because it handles this more gracefully."
id: {msg_id}:{token_offset}, data: {token}stream:{msg_id}).Last-Event-ID: {msg_id}:47Key design elements:
stream:{message_id} with 5-minute TTL. This is cheap — a typical response is ~500 tokens × 5 bytes = 2.5 KB per stream.IC7 signal: "In a 1:1 chat with an LLM, ordering is simpler than in a multi-party system — but there are still real edge cases I want to address."
Why ordering is (mostly) simpler here: In a Claude 1:1 chat, the conversation is strictly turn-based: user sends → Claude responds → user sends. There's natural serialization because Claude can't respond until the user message arrives, and the user typically waits for Claude's response. This eliminates 90% of multi-party ordering problems.
Where ordering still matters:
Solution: ULIDs + optimistic concurrency on conversation version:
version counter in PostgreSQL. When the Chat Service processes a user message, it does: UPDATE conversations SET version = version + 1 WHERE conversation_id = ? AND version = ?. If the CAS fails (concurrent write from another device), retry with conflict resolution.IC7 signal: "This is the wide partition problem, and it's one of the first things I think about when I choose conversation_id as my partition key. Let me walk through why I'm not worried for our use case, but also what I'd do if I were."
The math:
Why this is unlikely but addressable:
Bucketed partitioning strategy:
| Column | Type | Key | Notes |
|---|---|---|---|
| conversation_id | UUID | Partition Key (1/2) | Conversation identifier |
| bucket | INT | Partition Key (2/2) | bucket = message_sequence / 1000 |
| message_id | ULID | Clustering Key ↓ | Time-ordered within bucket |
| ...same columns as before... | |||
Query pattern changes:
conversation_buckets counter in PostgreSQL tracks the current bucket number, so we always know the latest bucket without scanning.But here's my IC7 judgment call: I would NOT implement bucketing on day one. I'd ship with simple conversation_id partitioning, set a monitoring alert at partition size > 50 MB, and add bucketing when we see real data hit that threshold. Premature optimization of a partition scheme that affects every query is worse than a migration later. The 99.9th percentile conversation will never hit this limit.
IC7 signal: "Search is fundamentally a different access pattern from chat retrieval, and I would never bolt it onto Cassandra. This is a separate indexing pipeline."
Elasticsearch index schema:
messages-2026-05user_id — all of one user's messages land on the same shard, so search is a single-shard queryuser_id (keyword), conversation_id (keyword), role (keyword), content (text, analyzed), created_at (date), model (keyword){"bool": {"filter": [{"term": {"user_id": "..."}}], "must": [{"match": {"content": "recipe"}}]}}Search latency target: P99 < 200ms. This is achievable with user_id routing (single-shard) and warm caches for active users.
Why not Cassandra SASI or SAI? Jake may probe this directly given his DataStax background. SAI (Storage-Attached Indexes) in Cassandra 5.0+ support basic text search, but they're partition-local — you can't search across all conversations for a user without a full cluster scan. Elasticsearch gives us cross-partition full-text search with relevance scoring, which is what users actually need.
safety_result: "filtered" and excluded from search results by default. This prevents users from searching for and retrieving content that was blocked by the safety pipeline.IC7 signal: "Safety is the reason Anthropic exists. I'm going to design this as a synchronous, mandatory pipeline — not an optional middleware. Every message, every token, every time."
Gate 1 — Input Classification (pre-inference):
Gate 2 — Output Classification (during + post-inference):
IC7 signal: "Billing accuracy is a trust issue, not just an accounting issue. If we overcharge, we lose customers. If we undercharge, we lose money on GPU inference — the most expensive line item in the company. I'll design for exact counting, not estimation."
Token counting architecture:
Edge cases that matter for billing:
IC7 signal: "I'll be precise about what consistency level I'm choosing at each layer and why — because 'strong consistency' is not a useful answer without specifying the scope."
Consistency guarantees by layer:
| Operation | Consistency | Implementation | Why |
|---|---|---|---|
| Write message | LOCAL_QUORUM | Cassandra: RF=3, write to 2/3 replicas | Durability without cross-DC latency |
| Read own messages (same user) | Read-after-write | LOCAL_QUORUM reads; or route to same node (sticky sessions) | User must see their own message immediately |
| Read on other device | Eventual (~2s) | LOCAL_ONE read + device sync via Kafka consumer | Acceptable lag for multi-device; consistent within seconds |
| Conversation list | Read-after-write | PostgreSQL with synchronous_commit | User creates conversation, must see it in list immediately |
| Search index | Eventual (~5s) | Kafka → Elasticsearch indexer consumer lag | Users accept search being slightly behind. No business-critical real-time search. |
| Token count / billing | Exact, eventually consistent | Kafka → ClickHouse aggregation, hourly flush | Must be accurate but not real-time. Billing is hourly. |
The critical insight: In a 1:1 chat with an LLM, the strongest consistency requirement is read-your-own-writes for the active user on the active device. Everything else can be eventual. This is because the user is the only writer to their conversation (Claude's responses are system-generated through the same pipeline), and the most important UX guarantee is "I see what I just typed."
Cassandra consistency levels:
IC7 signal: "Multi-device sync in a chat system is often over-engineered. For a 1:1 LLM chat, the sync model is simpler than WhatsApp or iMessage because there's only one conversation partner (Claude), and responses are generated server-side."
How sync works:
Last-Event-ID. The Connection Manager replays missed events from Kafka (or a Redis catchup buffer if within 5 min, or queries Cassandra directly for older messages).Conflict resolution (user types on both devices):
Interviewers: Anton Paquin & Bharat Mediratta (Google Distinguished Engineer / GWS, ex-CTO Dropbox, Coinbase Fellow)
Friday May 22, 2:00 – 2:55 PM PT
IC7 opening (first 30 seconds): "I'll design the public-facing API that serves Claude to external developers. This is Anthropic's revenue engine — every dollar flows through this API. I'll scope to: multi-model serving (Haiku, Sonnet, Opus), synchronous + streaming responses, multi-tenant isolation with per-org rate limiting, and a target of P99 time-to-first-token under 800ms for Sonnet. I'll explicitly call out where safety constraints shape the architecture differently from a generic serving system."
Requirements I'm driving:
Back-of-envelope:
Key architectural decisions:
claude-4-sonnet-20260501), a weight file URI, a configuration hash, and a safety classifier version. Rollbacks are instant: just update the routing table to point to the previous version.IC7 signal: "I'll design the external API with the same rigor as a public specification — versioned, idempotent, with explicit error contracts. Bharat built GWS; he knows what a production API contract looks like."
API versioning strategy:
Anthropic-Version: 2026-05-01), not URL-based. This allows us to make backward-compatible changes without forcing URL migrations. Breaking changes require a new version date.claude-4-sonnet-20260501. The alias claude-4-sonnet (no date) resolves to the latest — but we log which exact version was used in the response, so clients can reproduce results.Idempotency:
X-Request-ID is the idempotency key. If we receive a duplicate within 24 hours, we return the cached response without re-running inference. The dedup store is Redis with 24h TTL, keyed by {org_id}:{request_id}.IC7 signal: "This is the hardest problem in a multi-tenant serving system, and it's one Bharat lived at Google. The naive approach — per-org rate limits at the gateway — doesn't work because the bottleneck isn't HTTP requests, it's GPU-seconds. I need to solve fairness at the GPU scheduling layer."
Three layers of isolation:
Layer 1 — Admission Control (API Gateway):
429 Too Many Requests with retry_after header. This is the coarse filter — keeps volume within contracted limits.Layer 2 — Weighted Fair Queuing (Tenant Scheduler):
Layer 3 — Worker-Level Isolation:
IC7 signal: "I'll use Little's Law to derive GPU count from SLO requirements, not just throughput. The constraint isn't 'can we serve all requests' — it's 'can we serve them within our latency SLO while maintaining headroom for bursts.'"
Little's Law: L = λ × W
Capacity planning methodology (what Bharat will want to hear):
IC7 signal: "GPU OOM is the most common inference failure mode, and it's fundamentally different from CPU OOM because GPU memory is shared across batched requests. A single OOM can kill an entire batch."
Prevention mechanisms (most OOMs should never happen):
KV_size = 2 × num_layers × hidden_dim × (input_tokens + max_tokens) × sizeof(float16). If estimated total exceeds 85% of GPU memory, reject admission to this batch.IC7 signal: "Model rollouts are fundamentally harder than code rollouts because regressions are subtle — a model might generate slightly worse responses without any errors or latency changes. I need both automated and human-in-the-loop validation."
model_alias → previous_version_id. Takes effect in < 10 seconds (next routing table refresh). No GPU restart needed — both model versions are loaded in parallel during rollout. Old version workers are kept warm for 72 hours post-completion.The hard part — detecting quality regression:
IC7 signal: "Rate limiting at an inference API is multi-dimensional. I can't just count requests — I need to count tokens, because a 100K-token Opus request consumes 1000x more resources than a 100-token Haiku request."
ratelimit:{org_id}:{model}:{dimension} → {tokens_remaining, window_start}Implementation — Sliding Window + Token Bucket hybrid:
Rate limit headers in every response:
IC7 signal: "Continuous batching is the single biggest throughput optimization in LLM serving, and understanding the prefill/decode asymmetry is essential to capacity planning. Let me walk through the GPU timeline."
Prefill vs. Decode — fundamentally different GPU workloads:
| Property | Prefill | Decode |
|---|---|---|
| What it does | Process all input tokens in parallel. Build KV cache. | Generate one output token per step. Read KV cache. |
| GPU bottleneck | Compute-bound (matrix multiplies on all input tokens) | Memory-bandwidth-bound (read model weights + KV cache per token) |
| Latency | ~100-500ms for 2K tokens (depends on model) | ~6-20ms per token |
| Batching benefit | High — amortizes weight loading across inputs | Very high — many decode requests share one weight read |
| Memory usage | Spikes during attention computation | Grows linearly with generated tokens (KV cache) |
Key insight for continuous batching: Since decode is memory-bandwidth-bound, adding more requests to a decode batch costs almost zero additional time (the GPU reads model weights once regardless of batch size). This is why continuous batching achieves 10-30x better throughput than static batching — the GPU is never waiting for a full batch to form, and new requests slot into the decode batch immediately after their prefill.
Scheduling optimization — chunked prefill:
IC7 signal: "I'll walk through a systematic debugging framework, not just guess. The right approach is: scope the problem, check the dashboards, narrow to a component, and identify root cause. This is how I'd lead an incident."
Step 1 — Scope:
Step 2 — Component-by-component latency breakdown:
Step 3 — Most likely causes (in order of probability):
IC7 signal: "GPU cost is the dominant line item — likely 60-70% of COGS for an inference company. I'll walk through the optimization hierarchy, from biggest impact to smallest, with the SLO trade-off for each."
Cost optimization hierarchy (ordered by impact):
| Optimization | Savings | SLO Impact | Complexity |
|---|---|---|---|
| 1. KV cache reuse (prompt caching) | 30-50% GPU time for repeat prefixes | Improves TTFT (skip prefill) | Medium |
| 2. Continuous batching | 10-30x throughput vs. static | Neutral (improves utilization) | High (already assumed) |
| 3. Speculative decoding | 2-3x decode speed | Improves per-token latency | High |
| 4. Model distillation / tiering | 10x cost difference (Haiku vs Opus) | Quality trade-off (user choice) | Low (product decision) |
| 5. Spot/preemptible GPUs for batch | 60-70% on batch workloads | No SLO (batch is async) | Medium |
| 6. Quantization (INT8/FP8) | ~2x memory efficiency | Minimal quality loss (<0.5%) | Medium |
| 7. Right-sizing max_tokens | 5-10% KV cache savings | Neutral | Low |
The biggest lever — KV cache reuse:
Speculative decoding:
What to say in the first 30 seconds of each question to signal staff-level judgment.
"The goal is not to answer every question perfectly.
The goal is to show you'd be the person they trust to own the design."
Good luck tomorrow. You know this material.