UtkrushtBackend Capacity · Decision Record

Can we run 350 assessments at the same time?

A capacity analysis of two very different concurrent workloads — proctored tests and Fargate-analysed tasks — mapping where the load lands, what saturates first, and the choices we made about each.

16 Jun 2026 MCQ + Text + Audio, with proctoring uprod-be-1 · FastAPI + Flask + TUS (shared) T1–T3 ready · T6 deferred

Verdict — Tests

Reachable on one server after 3 fixes — two code, one config.

Verdict — Tasks

Conducting 350 is free; finishing them drains over hours, capped by Fargate.

Biggest surprise

Proctoring video — not answers — is what saturates FastAPI first.

FastAPI workers (async event loops)

24→80

Background LLM threads (proposed)

~117/s

TUS post-finish S3 copies at peak — blocking

TUS instance, sharing the box (SPOF)

12/16

Fargate analyses — now / ceiling

01 · Two workloads, two physics

The platform mixes two concurrency models that share almost nothing. We analysed them separately because their bottlenecks, limits, and fixes don't overlap.

Tests · synchronous, in-browser

Driven live through FastAPI + TUS

A candidate answers up to 10 questions while their webcam streams to TUS the whole time. Load is real-time and lands on the shared box.

Answers → FastAPI /v2/…/answers (direct)
Webcam chunks → TUS → S3 → /tusd/hooks back to FastAPI
Lifecycle → Next.js proxy → Flask
No polling, no heartbeat — status is event-driven

Tasks · async, offloaded

Analysed after submit, on Fargate

The candidate works in a sandbox (cheap for us). The heavy analysis runs later in an ephemeral container, off our servers entirely.

Conducting 350 tasks → negligible backend load
Finishing → Airflow queues, Fargate runs 12 at a time
Gemini MAP/REDUCE per screenshare segment
Proctored tasks add a 2nd TUS stream (screenshare)

02 · Where the load actually lands

One proctored test candidate opens three concurrent paths. FastAPI, Flask, and TUS all live on the same box and compete for its CPU and RAM — the crowding is the story.

FastAPI (9000) Flask / Next.js proxy TUS + S3 video Async analysis (Airflow / Fargate) Blocking hot-spot

03 · What one proctored candidate costs

Answers are cheap and bursty. The webcam is the relentless background load — every 3 seconds, for the whole test.

~20 /min

webcam chunks (3s timeslice), each its own TUS upload

675 KB

per chunk · VP8 1.8 Mbps + Opus 96 kbps

~0.9 Mbps

sustained uplink → ~315 Mbps across 350

~25

total HTTP requests per ~30-min test (excl. video)

Question type	How it's graded	LLM work	Cost to concurrency
MCQ	Synchronous, inline — index vs. answer snapshot	None	Minimal one DB read+write
Text	Background thread via `schedule_once()`	4 sequential gpt-5.1 calls	High holds a thread 40–100s
Audio	Background thread — STT first, then same path	STT (3–15s) + the same 4-call analysis	Highest STT on top of Text

get_analysis() (shared/ai/ai_util.py:337) is shared by Text and Audio — only the input string differs (typed answer vs. transcript). MCQ never calls it.

04 · The bottleneck ledger

Six pressure points for tests, plus the one by-design limit for tasks. Severity is about what breaks first at 350, not how hard it is to fix.

Critical — saturates first High — delays results Medium — headroom risk By design

#	Where	Finding	Severity	Fix
B1	Tests · FastAPI	Sync Supabase calls inside the `async def` answer handler freeze the event loop	Critical	T1
B2	Tests · TUS hook	`/tusd/hooks` runs a blocking `s3.copy_object` in `async def`, ~117×/s at peak	Critical	T2
B3	Tests · bg pool	24-thread pool caps Text/Audio analysis throughput	High	T3
B4	Tests · Flask	4 sync workers saturate in the test-end completion burst	Medium	T4
B5	Tests/Tasks · TUS	Single TUS instance shares uprod-be-1 RAM/CPU; SPOF	Medium	T5
B6	Tests · Airflow	Post-test proctor-video ffmpeg runs on the Airflow box, no `max_active_runs`	Medium	T5
—	Tasks · Fargate	Capped at 12 (ceiling 16 = 64 vCPU ÷ 4); Gemini RPM/TPM on 429	By design	levers

05 · Spotlight — the hook that saturates the loop

B2 is the finding that reframed the whole analysis. Each 3-second webcam chunk triggers a blocking S3 copy on the very workers that serve answers. The demand outruns capacity on its own.

FastAPI event-loop demand vs. capacity — worker-seconds of blocking work per wall-second

4 workers = 4.0 worker-seconds of capacity per second (the green line). Anything past it queues.

Even on the optimistic 30 ms-per-copy estimate, post-finish hooks eat ~88% of all four event loops before a single answer is processed. At 50 ms they're oversubscribed. A fresh boto3.client() is also built on every call — no connection reuse. Source: fastapi_service/routers/tus.py:96-107.

06 · Analysis throughput — the thread-pool dial

Text/Audio answers queue behind 24 background threads. The answer returns "queued" instantly, so candidates don't feel it — but recruiter visibility lags. Widening the pool is a one-line config change.

Time to clear 1,750 Text/Audio answers (350 students × ~5 each)

~60s average per answer · all submitted near-simultaneously (worst case)

Threads are >95% blocked on network IO (the GIL releases during socket calls), so they run in true parallel; the cost is ~8 MB RAM each. BACKGROUND_MAX_WORKERS=20 → 4×20 = 80 slots. Watch OpenAI TPM (gpt-5.1 reasoning can spike 400k+); RPM (~320) stays within Tier-2's 500.

07 · Tasks — conducting is free, finishing is a queue

Running 350 tasks at once barely touches our servers. The constraint is the analysis queue, deliberately capped so we never exceed the AWS Fargate vCPU quota.

Fargate analysis drain at 350 submissions

12 concurrent containers · ~35 min each

Raising task throughput

Three levers — move together

Bumping one without the others just trips the next ceiling.

1 · DAG cap	`MAX_CONCURRENT_ANALYSES` in the Fargate DAG (code + deploy)
2 · AWS	Fargate vCPU 64 → 128 ⇒ ceiling 16 → 32
3 · Gemini	Raise API quota so higher concurrency doesn't 429

⚠ Failure mode at scale

On Gemini 429, every segment fails, the container writes the "All segments failed analysis" sentinel, and alert_on_analysis_failure pings Discord. Already wired — it's the signal to widen quota.

08 · The decisions — what we'll do, and what waits

Three classes of work: ship-now code fixes (T1–T2), config dials (T3–T4), and headroom / structural choices (T5). The deepest lever (T6) is gated behind an eval harness.

ID	Decision	Type	Effect
T1	Wrap MCQ/Text handlers in `asyncio.to_thread(…)` so sync Supabase calls leave the event loop	Code · now	Unblocks B1 — MCQ ack drops to <200 ms under load
T2	Kill the `/tusd/hooks` blocking S3 copy — set object metadata at upload time, or offload + reuse one boto3 client	Code · now	Removes B2 — ~117 blocking calls/s off the loops
T3	`BACKGROUND_MAX_WORKERS = 20` (24 → 80 analysis slots)	Config	Backlog 70 → ~22 min
T4	`NUM_OF_FLASK_WORKERS = 8`	Config	Absorbs the test-end completion burst
T5	Give TUS ≥4 GB headroom + alert now; cap the proctor-video DAG; later split TUS off the box & offload its ffmpeg to Fargate	Headroom	Removes B5/B6 risk; the structural path past 500
T6	Parallelise / use a cheaper model inside the 4-call `get_analysis()`	Deferred	2–6× per-answer speedup — only after evals

🔒 Why T6 waits — the gate we set

Changing the model or call structure risks degrading scoring quality in ways we can't see without ground truth. T6 is blocked until: (1) an eval harness exists with representative MCQ / Text / Audio answers across proficiency levels, (2) baseline scores are captured against today's gpt-5.1 pipeline, and (3) any change is A/B-tested on that set with an acceptable quality delta.

09 · When one server stops being enough

350 is reachable on uprod-be-1 after T1–T3. These are the tripwires that say "add capacity" — and what to split off first.

> 500 concurrent

Beyond the headroom T1–T5 buy → add a stateless FastAPI replica behind Coolify's LB

Universal proctoring

Every candidate streams webcam → split TUS off the box first — it contends soonest

Zero analysis lag

If "eventually consistent" recruiter views aren't acceptable → more workers / replicas

> 70% baseline

uprod-be-1 CPU or RAM above 70% before peak (watch in BetterStack)

Scaling paths differ by workload: FastAPI is stateless → horizontal replica. Tasks → coordinated quota raise (DAG + AWS vCPU + Gemini), not an app server. Post-test proctor video → offload ffmpeg to Fargate, mirroring the task pipeline that already proves the pattern.