the audit
Six tics made every document look generated. Here is what replaces them.
The old system had good bones — archetypes, self-contained rendering, real diagrams. But its mandates for a “memorable signature,” dramatic scale, and atmospheric backgrounds pushed every page toward the same recognizable AI look.
01What the audit found
| Before — the tic | After — the rule |
|---|---|
Gradient-filled headline text (background-clip:text) | Ink at scale. Size is the emphasis, never a fill effect. |
Pill badges on everything (border-radius:999px, filled) | Pills only for status, outlined, square-cornered. |
| Every block in a rounded, drop-shadowed card | Hairline rules and whitespace. A card must earn its border. |
| Centered hero: eyebrow + huge headline + subtitle + emoji | Left-aligned title block; one eyebrow per document, at the top. |
| Balanced five-color palettes; status hues used decoratively | Ink does the work. One accent, under 10% of the surface. |
| The same warm radial-gradient wash behind every document | Flat paper. Variety comes from type and accent, not texture. |
02Before / after
Supercharging Our Infrastructure
Infrastructure held at 4× traffic. Two systems didn’t.
What scaled, what buckled, and the three fixes we’re funding next quarter.
30 Jun 2026 · platform team · internal
03The ten rules
- Ink does the work. Color annotates; it never carries the hierarchy.
- One accent per document, used on less than ~10% of the surface: eyebrow, section numbers, links, one rule.
- Hairlines over shadows. Corner radius 4px or less. Shadows only on genuinely interactive elements.
- A card must earn its border. The default container is whitespace and a horizontal rule.
- Left-align. No centered heroes. Titles state a finding, not a theme.
- Hierarchy comes from the type scale, not from boxes, backgrounds, or weight-800 sans.
- Pills are for status only — outlined, uppercase, 11px. Everything else is plain text.
- No gradient text, no emoji as icons, no decorative gradients anywhere.
- Diagrams are monochrome plus the accent. Every box and arrow labeled; a legend only past two colors.
- Status colors mean status. Green/amber/red appear only when something is actually good, at risk, or broken.
04Tokens
Neutrals — shared by every document
Accents — one per document, never together
Each tab in this showcase uses a different accent — that is the entire per-document theming budget. Same paper, same ink, same hairlines everywhere.
Type scale — Georgia display over system sans
RFC-041 · platform
Take image processing off the request path
Uploads block on resize and transcode today. This RFC proposes returning 202 immediately and processing asynchronously through a queue.
Upload p95 is 3.1s because resize/transcode runs inline. Move processing to a queue and worker pool, return 202 Accepted with a status URL, and target p95 under 400ms. Cost: one new managed component (SQS). Rollout is dual-write behind a flag, two weeks.
01Problem
Every image upload synchronously resizes to four variants and transcodes to WebP before responding. At current volume (~140k uploads/day) this holds a request thread for 2–4 seconds, dominates API p95, and couples upload availability to ImageMagick failures — 31% of our 5xx responses in June traced to processing errors, not upload errors.
02Goals and non-goals
Goals
- Upload p95 under 400ms
- Processing failures never fail the upload
- Client can poll or subscribe for readiness
Non-goals
- Changing the variant set or formats
- On-the-fly transformation API
- Migrating existing stored images
03Proposed design
The API stores the original and enqueues a job. Workers pull, process, write variants, and flip the asset’s status. The client gets a status_url in the 202 response.
04Alternatives considered
| Option | Upload p95 | New infra | Verdict |
|---|---|---|---|
| Async queue + workers | ~350ms | SQS + worker deploy | Proposed |
| Bigger API instances | ~1.9s | none | Cost scales with the wrong axis; failures still coupled |
| Process in Lambda on S3 event | ~350ms | Lambda + IAM surface | Viable; weaker retry semantics, cold-start tail |
05Rollout
- Week 0 — queue + worker deployed, shadow traffic only
- Week 1 — dual-write behind
async_imagesflag, 5% → 50% - Week 2 — 100%, remove inline path, delete ImageMagick from API image
06Risks
| Risk | Mitigation |
|---|---|
| Clients that assume variants exist at 200 | Audit found 3 call sites; all migrate to status_url polling this sprint |
| Queue backlog during traffic spikes | Worker autoscale on queue depth; alert at 5-minute lag |
| Lost jobs | SQS at-least-once + idempotent workers; DLQ with replay runbook |
platform health · week 26
Steady week. Ingest error rate is the one thing to watch.
01p95 latency, last 14 days
02Error rate by service
Ingest is elevated because retries against the legacy image path double-count during the RFC-041 rollout. Expected to normalize when the inline path is removed in week 28.
03Active alerts
| Alert | Service | Since | State |
|---|---|---|---|
| Queue lag over 5 min | ingest | Jun 27 | Investigating |
| Cert expires in 20 days | edge | Jun 29 | Scheduled |
| Elevated 429s, one tenant | api | Jun 28 | Resolved |
decision · infrastructure
Queue backend for the image pipeline
Three candidates evaluated against six criteria for RFC-041. Weighted toward operational burden — this team runs no 24/7 on-call for infra.
SQS. It concedes throughput ceiling and strict ordering — neither of which this workload needs — and wins everywhere we actually feel pain: zero ops, native DLQ, per-message visibility timeouts. Confidence: high.
01Scoring matrix
| Criterion | SQS · recommended | Redis Streams | Kafka |
|---|---|---|---|
| Ops burden (×2 weight) | ●●●●● | ●●●○○ | ●○○○○ |
| Retry / DLQ semantics | ●●●●● | ●●○○○ | ●●●○○ |
| Throughput ceiling | ●●●○○ | ●●●●○ | ●●●●● |
| Ordering guarantees | ●●○○○ | ●●●●○ | ●●●●● |
| Team familiarity | ●●●●○ | ●●●●○ | ●●○○○ |
| Cost at our scale | ●●●●○ ~$40/mo | ●●●○○ ~$120/mo | ●○○○○ ~$900/mo |
02Where each one wins — and what we give up
SQS
RecommendedZero servers, IAM-native, DLQ and visibility timeouts out of the box. Trade-off we accept: ~3k msg/s soft ceiling per queue and best-effort ordering. The pipeline is embarrassingly parallel and peaks at 40 msg/s — two orders of magnitude of headroom.
Redis Streams
Fastest option and we already run Redis for caching. But consumer-group DLQ behavior is hand-rolled, and it couples job durability to a cache instance we currently treat as disposable.
Kafka
The right answer at 100× our volume or with multiple consumers replaying history. Today it is three brokers of ops burden for guarantees nothing downstream consumes.
roadmap · search relaunch
Search relaunch, H2 2026
01Phases
Foundations
DoneMay 5 – Jun 13- Cluster provisioned; embedding model selected (bge-small, self-hosted)
- Golden query set — 400 queries with judged relevance
- Offline eval harness reporting nDCG@10 per experiment
Indexing pipeline
In progressJun 16 – Aug 15- Change-data-capture from Postgres into the index queue
- Shadow-index 10% → 100% of live writes ← current
- Backfill full corpus (28M docs) with checkpointed batch job
Query & ranking
UpcomingSep 1 – Oct 24- Hybrid retrieval (BM25 + ANN) behind the existing search API
- Interleaving experiment against production ranking
- Ship if nDCG@10 improves ≥ 8% with p95 under 250ms
GA rollout
UpcomingNov 2 – Nov 27- 10% → 100% traffic; ES6 read path removed
- Decommission legacy cluster (saves ~$3.1k/mo)
02What’s next, and what could slip it
| Immediate (2 weeks) | Decisions needed | Risks |
|---|---|---|
| Shadow indexing to 100%; backfill dry-run on 1M docs | Reranker: ship with hybrid only, or add cross-encoder in phase 3? owner: mira, by Jul 11 | Backfill contends with nightly analytics load on the Postgres replica — may need a dedicated replica (+1 week) |
runbook · publishing
Publishing runbook
How to publish, update, and debug documents on html-docs. Answers are under 150 words; commands are copy-ready.
Getting started
How do I publish a page?
One POST, no account needed:
curl -sS -X POST https://www.html-docs.com/api/v1/docs -H 'Content-Type: text/html' --data-binary @page.html
The response contains url (share this) and token (keep it — it authorizes every later edit).
Can I choose my own URL?
Add an X-Slug: my-name header and the page is served at /site/my-name. Slugs are first-come, first-served; omit the header for an auto-generated one.
Updating
How do I update a published page?
PUT replaces the whole document:
curl -sS -X PUT https://www.html-docs.com/api/v1/docs/<id> -H 'x-doc-token: <token>' -H 'Content-Type: text/html' --data-binary @new.html
If the doc has comments, prefer PATCHing a single region instead — a full PUT orphans comment anchors.
I lost the token. Can I still edit?
Only if the doc belongs to your account: an API key (Authorization: Bearer hdk_…) works on all docs you own. Anonymous docs with a lost token are immutable — republish and share the new URL.
Troubleshooting
My page renders blank or unstyled
Almost always an external dependency. The viewer renders inside a sandboxed shadow DOM: CDN scripts, external stylesheets, and remote fonts are stripped. Inline all CSS in one <style> block and hand-author charts as inline SVG.
My chart library doesn’t draw anything
Same cause — Chart.js, D3, and Mermaid load from CDNs and are blocked. Bars are <rect>s, lines are <polyline>s. It is less code than the library setup for most doc-sized charts.
Check the live API contract at GET /api/v1, or ask in #html-docs. Include the doc id and the exact curl command you ran.
weekly debrief · platform
Week 26: the async pipeline shipped to shadow, and we chose our queue
- Image pipeline — RFC-041 approved by 2 of 3 reviewers; queue + workers live in shadow mode, processing 100% of uploads with zero user-facing writes.
- Search relaunch — CDC indexing turned on for 10% of writes; eval harness caught a tokenizer bug that would have silently hurt recall.
- Ops — one warn-level incident (queue lag, 22 min, no data loss); June cost review came in 4% under budget.
01Image pipeline
What happened: the async processing path from RFC-041 went from design to running code. Workers process every upload in shadow mode; output variants are diffed against the inline path nightly — 0 mismatches across 610k images so far.
Why it matters: this is the last de-risking step before the flag flips next week. Upload p95 on the shadow path is 341ms against a 400ms target.
SQS over Redis Streams and Kafka for the job queue — ops burden was the deciding criterion. Full matrix in the comparison doc. Revisit if sustained volume exceeds 1k msg/s.
02Search relaunch
What happened: change-data-capture indexing is live on 10% of writes. The offline eval harness flagged that the new analyzer dropped hyphenated tokens (t-shirt → shirt), cutting recall on 3% of the golden set — fixed before any user saw it.
Why it matters: the harness paid for itself in week one. Every ranking change now gets a scored nDCG@10 report in CI instead of a vibes check.
Backfill gets a dedicated Postgres replica. Costs ~$400 for six weeks; removes the contention risk with nightly analytics that threatened the Aug 15 milestone.
03Action items
| Item | Owner | Due |
|---|---|---|
Flip async_images to 5% and watch DLQ for 48h | priya | Jul 2 |
| Migrate 3 call sites that assume variants at 200 | sam | Jul 4 |
| Reranker recommendation memo (phase-3 scope) | mira | Jul 11 |
| Provision backfill replica + teardown date on calendar | dev | Jul 8 |