A capacity analysis of two very different concurrent workloads — proctored tests and Fargate-analysed tasks — mapping where the load lands, what saturates first, and the choices we made about each.
The platform mixes two concurrency models that share almost nothing. We analysed them separately because their bottlenecks, limits, and fixes don't overlap.
A candidate answers up to 10 questions while their webcam streams to TUS the whole time. Load is real-time and lands on the shared box.
/v2/…/answers (direct)/tusd/hooks back to FastAPIThe candidate works in a sandbox (cheap for us). The heavy analysis runs later in an ephemeral container, off our servers entirely.
One proctored test candidate opens three concurrent paths. FastAPI, Flask, and TUS all live on the same box and compete for its CPU and RAM — the crowding is the story.
Answers are cheap and bursty. The webcam is the relentless background load — every 3 seconds, for the whole test.
| Question type | How it's graded | LLM work | Cost to concurrency |
|---|---|---|---|
| MCQ | Synchronous, inline — index vs. answer snapshot | None | Minimal one DB read+write |
| Text | Background thread via schedule_once() | 4 sequential gpt-5.1 calls | High holds a thread 40–100s |
| Audio | Background thread — STT first, then same path | STT (3–15s) + the same 4-call analysis | Highest STT on top of Text |
get_analysis() (shared/ai/ai_util.py:337) is shared by Text and Audio — only the input string differs (typed answer vs. transcript). MCQ never calls it.
Six pressure points for tests, plus the one by-design limit for tasks. Severity is about what breaks first at 350, not how hard it is to fix.
| # | Where | Finding | Severity | Fix |
|---|---|---|---|---|
| B1 | Tests · FastAPI | Sync Supabase calls inside the async def answer handler freeze the event loop | Critical | T1 |
| B2 | Tests · TUS hook | /tusd/hooks runs a blocking s3.copy_object in async def, ~117×/s at peak | Critical | T2 |
| B3 | Tests · bg pool | 24-thread pool caps Text/Audio analysis throughput | High | T3 |
| B4 | Tests · Flask | 4 sync workers saturate in the test-end completion burst | Medium | T4 |
| B5 | Tests/Tasks · TUS | Single TUS instance shares uprod-be-1 RAM/CPU; SPOF | Medium | T5 |
| B6 | Tests · Airflow | Post-test proctor-video ffmpeg runs on the Airflow box, no max_active_runs | Medium | T5 |
| — | Tasks · Fargate | Capped at 12 (ceiling 16 = 64 vCPU ÷ 4); Gemini RPM/TPM on 429 | By design | levers |
B2 is the finding that reframed the whole analysis. Each 3-second webcam chunk triggers a blocking S3 copy on the very workers that serve answers. The demand outruns capacity on its own.
Even on the optimistic 30 ms-per-copy estimate, post-finish hooks eat ~88% of all four event loops before a single answer is processed. At 50 ms they're oversubscribed. A fresh boto3.client() is also built on every call — no connection reuse. Source: fastapi_service/routers/tus.py:96-107.
Text/Audio answers queue behind 24 background threads. The answer returns "queued" instantly, so candidates don't feel it — but recruiter visibility lags. Widening the pool is a one-line config change.
Threads are >95% blocked on network IO (the GIL releases during socket calls), so they run in true parallel; the cost is ~8 MB RAM each. BACKGROUND_MAX_WORKERS=20 → 4×20 = 80 slots. Watch OpenAI TPM (gpt-5.1 reasoning can spike 400k+); RPM (~320) stays within Tier-2's 500.
Running 350 tasks at once barely touches our servers. The constraint is the analysis queue, deliberately capped so we never exceed the AWS Fargate vCPU quota.
Bumping one without the others just trips the next ceiling.
| 1 · DAG cap | MAX_CONCURRENT_ANALYSES in the Fargate DAG (code + deploy) |
| 2 · AWS | Fargate vCPU 64 → 128 ⇒ ceiling 16 → 32 |
| 3 · Gemini | Raise API quota so higher concurrency doesn't 429 |
On Gemini 429, every segment fails, the container writes the "All segments failed analysis" sentinel, and alert_on_analysis_failure pings Discord. Already wired — it's the signal to widen quota.
Three classes of work: ship-now code fixes (T1–T2), config dials (T3–T4), and headroom / structural choices (T5). The deepest lever (T6) is gated behind an eval harness.
| ID | Decision | Type | Effect |
|---|---|---|---|
| T1 | Wrap MCQ/Text handlers in asyncio.to_thread(…) so sync Supabase calls leave the event loop |
Code · now | Unblocks B1 — MCQ ack drops to <200 ms under load |
| T2 | Kill the /tusd/hooks blocking S3 copy — set object metadata at upload time, or offload + reuse one boto3 client |
Code · now | Removes B2 — ~117 blocking calls/s off the loops |
| T3 | BACKGROUND_MAX_WORKERS = 20 (24 → 80 analysis slots) |
Config | Backlog 70 → ~22 min |
| T4 | NUM_OF_FLASK_WORKERS = 8 |
Config | Absorbs the test-end completion burst |
| T5 | Give TUS ≥4 GB headroom + alert now; cap the proctor-video DAG; later split TUS off the box & offload its ffmpeg to Fargate | Headroom | Removes B5/B6 risk; the structural path past 500 |
| T6 | Parallelise / use a cheaper model inside the 4-call get_analysis() |
Deferred | 2–6× per-answer speedup — only after evals |
Changing the model or call structure risks degrading scoring quality in ways we can't see without ground truth. T6 is blocked until: (1) an eval harness exists with representative MCQ / Text / Audio answers across proficiency levels, (2) baseline scores are captured against today's gpt-5.1 pipeline, and (3) any change is A/B-tested on that set with an acceptable quality delta.
350 is reachable on uprod-be-1 after T1–T3. These are the tripwires that say "add capacity" — and what to split off first.
Scaling paths differ by workload: FastAPI is stateless → horizontal replica. Tasks → coordinated quota raise (DAG + AWS vCPU + Gemini), not an app server. Post-test proctor video → offload ffmpeg to Fargate, mirroring the task pipeline that already proves the pattern.