UtkrushtBackend Capacity · Decision Record

Can we run 350 assessments at the same time?

A capacity analysis of two very different concurrent workloads — proctored tests and Fargate-analysed tasks — mapping where the load lands, what saturates first, and the choices we made about each.

16 Jun 2026 MCQ + Text + Audio, with proctoring uprod-be-1 · FastAPI + Flask + TUS (shared) T1–T3 ready · T6 deferred
Verdict — Tests
Reachable on one server after 3 fixes — two code, one config.
Verdict — Tasks
Conducting 350 is free; finishing them drains over hours, capped by Fargate.
Biggest surprise
Proctoring video — not answers — is what saturates FastAPI first.
4
FastAPI workers (async event loops)
24→80
Background LLM threads (proposed)
~117/s
TUS post-finish S3 copies at peak — blocking
1
TUS instance, sharing the box (SPOF)
12/16
Fargate analyses — now / ceiling
01 · Two workloads, two physics

The platform mixes two concurrency models that share almost nothing. We analysed them separately because their bottlenecks, limits, and fixes don't overlap.

Tests · synchronous, in-browser

Driven live through FastAPI + TUS

A candidate answers up to 10 questions while their webcam streams to TUS the whole time. Load is real-time and lands on the shared box.

  • Answers → FastAPI /v2/…/answers (direct)
  • Webcam chunks → TUS → S3 → /tusd/hooks back to FastAPI
  • Lifecycle → Next.js proxy → Flask
  • No polling, no heartbeat — status is event-driven
Tasks · async, offloaded

Analysed after submit, on Fargate

The candidate works in a sandbox (cheap for us). The heavy analysis runs later in an ephemeral container, off our servers entirely.

  • Conducting 350 tasks → negligible backend load
  • Finishing → Airflow queues, Fargate runs 12 at a time
  • Gemini MAP/REDUCE per screenshare segment
  • Proctored tasks add a 2nd TUS stream (screenshare)
02 · Where the load actually lands

One proctored test candidate opens three concurrent paths. FastAPI, Flask, and TUS all live on the same box and compete for its CPU and RAM — the crowding is the story.

FastAPI (9000) Flask / Next.js proxy TUS + S3 video Async analysis (Airflow / Fargate) Blocking hot-spot
Candidate browser 3 live paths answers webcam · 20/min lifecycle SHARED HOST · uprod-be-1 FastAPI · 4 workers async event loops + 24→80 bg threads (LLM) TUS · 1 instance streams 5 MB parts → S3 Flask · 4 sync workers 1 request each S3 video post-finish hook blocking s3.copy ×117/s OFF-BOX · ASYNC Airflow → ffmpeg + Gemini (tests) Fargate ×12 → Gemini (tasks) on completion
03 · What one proctored candidate costs

Answers are cheap and bursty. The webcam is the relentless background load — every 3 seconds, for the whole test.

~20 /min
webcam chunks (3s timeslice), each its own TUS upload
675 KB
per chunk · VP8 1.8 Mbps + Opus 96 kbps
~0.9 Mbps
sustained uplink → ~315 Mbps across 350
~25
total HTTP requests per ~30-min test (excl. video)
Question typeHow it's gradedLLM workCost to concurrency
MCQSynchronous, inline — index vs. answer snapshotNoneMinimal  one DB read+write
TextBackground thread via schedule_once()4 sequential gpt-5.1 callsHigh  holds a thread 40–100s
AudioBackground thread — STT first, then same pathSTT (3–15s) + the same 4-call analysisHighest  STT on top of Text

get_analysis() (shared/ai/ai_util.py:337) is shared by Text and Audio — only the input string differs (typed answer vs. transcript). MCQ never calls it.

04 · The bottleneck ledger

Six pressure points for tests, plus the one by-design limit for tasks. Severity is about what breaks first at 350, not how hard it is to fix.

Critical — saturates first High — delays results Medium — headroom risk By design
#WhereFindingSeverityFix
B1Tests · FastAPISync Supabase calls inside the async def answer handler freeze the event loopCriticalT1
B2Tests · TUS hook/tusd/hooks runs a blocking s3.copy_object in async def, ~117×/s at peakCriticalT2
B3Tests · bg pool24-thread pool caps Text/Audio analysis throughputHighT3
B4Tests · Flask4 sync workers saturate in the test-end completion burstMediumT4
B5Tests/Tasks · TUSSingle TUS instance shares uprod-be-1 RAM/CPU; SPOFMediumT5
B6Tests · AirflowPost-test proctor-video ffmpeg runs on the Airflow box, no max_active_runsMediumT5
Tasks · FargateCapped at 12 (ceiling 16 = 64 vCPU ÷ 4); Gemini RPM/TPM on 429By designlevers

05 · Spotlight — the hook that saturates the loop

B2 is the finding that reframed the whole analysis. Each 3-second webcam chunk triggers a blocking S3 copy on the very workers that serve answers. The demand outruns capacity on its own.

FastAPI event-loop demand vs. capacity — worker-seconds of blocking work per wall-second
4 workers = 4.0 worker-seconds of capacity per second (the green line). Anything past it queues.
0 2.0 4.0 6.0 capacity = 4.0 (4 workers) B2 alone · ~5.85 117/s × 50 ms B2 (optimistic) · ~3.5 117 × 30 ms = 88% of capacity + B1 answers compounds on the same workers

Even on the optimistic 30 ms-per-copy estimate, post-finish hooks eat ~88% of all four event loops before a single answer is processed. At 50 ms they're oversubscribed. A fresh boto3.client() is also built on every call — no connection reuse. Source: fastapi_service/routers/tus.py:96-107.

06 · Analysis throughput — the thread-pool dial

Text/Audio answers queue behind 24 background threads. The answer returns "queued" instantly, so candidates don't feel it — but recruiter visibility lags. Widening the pool is a one-line config change.

Time to clear 1,750 Text/Audio answers (350 students × ~5 each)
~60s average per answer · all submitted near-simultaneously (worst case)
Today 24 threads ~70+ min Proposed · T3 80 threads ~22 min 0 30 min 60 min

Threads are >95% blocked on network IO (the GIL releases during socket calls), so they run in true parallel; the cost is ~8 MB RAM each. BACKGROUND_MAX_WORKERS=20 → 4×20 = 80 slots. Watch OpenAI TPM (gpt-5.1 reasoning can spike 400k+); RPM (~320) stays within Tier-2's 500.

07 · Tasks — conducting is free, finishing is a queue

Running 350 tasks at once barely touches our servers. The constraint is the analysis queue, deliberately capped so we never exceed the AWS Fargate vCPU quota.

Fargate analysis drain at 350 submissions
12 concurrent containers · ~35 min each
■ 12 running ▢ 4 headroom to ceiling (16) 350 ÷ 12 × ~35 min ≈ 17 h to fully drain · results land as each finishes
Raising task throughput

Three levers — move together

Bumping one without the others just trips the next ceiling.

1 · DAG capMAX_CONCURRENT_ANALYSES in the Fargate DAG (code + deploy)
2 · AWSFargate vCPU 64 → 128 ⇒ ceiling 16 → 32
3 · GeminiRaise API quota so higher concurrency doesn't 429
⚠ Failure mode at scale

On Gemini 429, every segment fails, the container writes the "All segments failed analysis" sentinel, and alert_on_analysis_failure pings Discord. Already wired — it's the signal to widen quota.


08 · The decisions — what we'll do, and what waits

Three classes of work: ship-now code fixes (T1–T2), config dials (T3–T4), and headroom / structural choices (T5). The deepest lever (T6) is gated behind an eval harness.

IDDecisionTypeEffect
T1 Wrap MCQ/Text handlers in asyncio.to_thread(…) so sync Supabase calls leave the event loop Code · now Unblocks B1 — MCQ ack drops to <200 ms under load
T2 Kill the /tusd/hooks blocking S3 copy — set object metadata at upload time, or offload + reuse one boto3 client Code · now Removes B2 — ~117 blocking calls/s off the loops
T3 BACKGROUND_MAX_WORKERS = 20 (24 → 80 analysis slots) Config Backlog 70 → ~22 min
T4 NUM_OF_FLASK_WORKERS = 8 Config Absorbs the test-end completion burst
T5 Give TUS ≥4 GB headroom + alert now; cap the proctor-video DAG; later split TUS off the box & offload its ffmpeg to Fargate Headroom Removes B5/B6 risk; the structural path past 500
T6 Parallelise / use a cheaper model inside the 4-call get_analysis() Deferred 2–6× per-answer speedup — only after evals
🔒 Why T6 waits — the gate we set

Changing the model or call structure risks degrading scoring quality in ways we can't see without ground truth. T6 is blocked until: (1) an eval harness exists with representative MCQ / Text / Audio answers across proficiency levels, (2) baseline scores are captured against today's gpt-5.1 pipeline, and (3) any change is A/B-tested on that set with an acceptable quality delta.

09 · When one server stops being enough

350 is reachable on uprod-be-1 after T1–T3. These are the tripwires that say "add capacity" — and what to split off first.

> 500 concurrent
Beyond the headroom T1–T5 buy → add a stateless FastAPI replica behind Coolify's LB
Universal proctoring
Every candidate streams webcam → split TUS off the box first — it contends soonest
Zero analysis lag
If "eventually consistent" recruiter views aren't acceptable → more workers / replicas
> 70% baseline
uprod-be-1 CPU or RAM above 70% before peak (watch in BetterStack)

Scaling paths differ by workload: FastAPI is stateless → horizontal replica. Tasks → coordinated quota raise (DAG + AWS vCPU + Gemini), not an app server. Post-test proctor video → offload ffmpeg to Fargate, mirroring the task pipeline that already proves the pattern.