Technical Specification for html-docs.com — v1.0, May 25 2026
Goal: Near 1:1 visual reproduction of PDF documents as editable HTML. Academic papers, reports, slide decks, and scanned documents should look like they came from the same printer, but in a live, editable web format.
Current state: PDF import exists but extracts raw text with pdfjs, then asks an LLM to "redesign" it. This loses all visual fidelity (layout, typography, spacing, columns) and times out on files over ~5 MB due to the synchronous server-action pipeline running inside Vercel's function timeout.
pdfjs-dist supports server-side rendering via its page.render() API, but requires a Canvas implementation. @napi-rs/canvas provides prebuilt native binaries for linux-x64-gnu (Vercel's serverless runtime) without needing system-level dependencies like Cairo.
Fallback: If canvas rendering fails (e.g., exotic fonts, corrupt page), the page falls back to the existing text+image extraction pipeline. The vision API still gets the extracted content, just without the visual layout reference.
// lib/pdf/render.ts
import { createCanvas } from '@napi-rs/canvas'
async function renderPage(page: PDFPageProxy, scale: number = 2.0): Promise<Buffer> {
const viewport = page.getViewport({ scale })
const canvas = createCanvas(viewport.width, viewport.height)
const context = canvas.getContext('2d')
await page.render({ canvasContext: context, viewport }).promise
return canvas.toBuffer('image/png')
}
Scale factor: 2.0x gives good visual fidelity for vision model input without exceeding GPT-4o's image token budget (~1400x1800px for letter-size = ~1200 tokens per page image at "high" detail).
| Format | Points (W x H) | Rendered PNG @ 2x | CSS Equivalent |
|---|---|---|---|
| US Letter | 612 x 792 | 1224 x 1584 | 8.5in x 11in |
| A4 | 595 x 842 | 1190 x 1684 | 210mm x 297mm |
| Legal | 612 x 1008 | 1224 x 2016 | 8.5in x 14in |
| Custom | Arbitrary | Capped at 2000px wide | Computed from ratio |
Page dimensions are read from page.getViewport() and stored per-page. The final HTML preserves these dimensions via CSS @page rules and max-width on the content container, so the document looks correct at all screen sizes and prints accurately.
Each page image is sent to GPT-4o (or gpt-4.1 when available) with this instruction pattern:
System: You reproduce PDF pages as faithful, semantic HTML+CSS.
Your output must visually match the source image as closely as possible.
Rules:
- Use semantic HTML: h1-h6, p, table, figure, blockquote, ul/ol, etc.
- Reproduce the exact visual layout: column count, spacing, alignment
- Preserve all text verbatim -- do not summarize, rewrite, or omit
- For math/equations: use inline SVG or HTML entity approximations
- For figures: output a <figure> placeholder with [FIGURE_N] marker
and the caption text. Figures will be injected in post-processing.
- Tables must reproduce the exact cell structure and content
- Use CSS Grid or Flexbox for multi-column layouts
- All styles must be inline or in a single <style> block
- Typography: match the source font style (serif/sans), size ratios,
weight, and color as closely as possible
- No external dependencies. No JavaScript. No remote fonts.
User: [page image attached]
Page {N} of {total}. Page dimensions: {width}pt x {height}pt.
{extracted_text_for_reference}
The extracted text from pdfjs is included as a reference to help with OCR accuracy -- the vision model can cross-check its reading against the programmatic text extraction. This catches cases where the image is slightly blurry but the text layer is clean.
Academic papers commonly use two-column layouts. The vision model is instructed to detect and reproduce these using CSS Grid:
.two-column {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 1.5rem;
max-width: 900px;
}
@media (max-width: 720px) {
.two-column { grid-template-columns: 1fr; }
}
The stitching pass verifies column detection is consistent across pages and normalizes the grid definitions.
Tables are reproduced by the vision model directly from the page image. The extracted text serves as a validation layer -- if the vision model's table content doesn't match the text extraction, the system flags it for a second pass. Complex tables (merged cells, nested headers) are handled by the vision model's spatial understanding of the image rather than programmatic detection.
Three-tier approach:
<img> with appropriate alt textFuture: Add MathML or KaTeX rendering as a post-processing option when the user opts in.
The current pipeline extracts raster images from the PDF's internal stream using pdfjs's operator list. This captures figures, charts, and photos but misses vector graphics and complex composited images.
New approach — hybrid extraction:
[FIGURE_N] placeholder the vision model outputs, identify the bounding box on the rendered page image and crop that region. This captures vector graphics, charts, and composited content that raster extraction misses.All images are uploaded to Supabase Storage and referenced by public URL in the final HTML, same as current pipeline.
| Step | Action | Duration (est.) | Status Label |
|---|---|---|---|
| 1 | Upload PDF to Supabase Storage | 1-3s | "Uploading PDF" |
| 2 | Parse PDF metadata (page count, title, dimensions) | 1-2s | "Reading PDF" |
| 3 | Render pages to PNG (parallel, batches of 8) | 2-10s | "Rendering page N of M" |
| 4 | Extract text + raster images (existing pipeline) | 2-5s | "Extracting content" |
| 5 | Vision API calls (parallel, batches of 4) | 15-90s | "Converting page N of M" |
| 6 | Upload extracted images to Storage | 2-5s | "Processing figures" |
| 7 | Stitch pages + normalize CSS | 5-15s | "Assembling document" |
| 8 | Create document + editable regions | 1-2s | "Finalizing" |
Total estimated time: 30-120 seconds for a typical 10-page academic paper. The client sees real progress at every step.
On Vercel, the cleanest approach is a dedicated API route with maxDuration: 300 that acts as the processing worker. The flow:
importPdfDocument server action, which uploads the PDF to Storage and creates a job record in a pdf_import_jobs table (Supabase), then returns immediately.fetch() to the internal processing route (/api/pdf/process) with the job ID. This is fire-and-forget -- the server action doesn't await the response./api/pdf/status?jobId=X every 2 seconds.// Supabase table: pdf_import_jobs
{
id: uuid (PK),
user_id: uuid (FK -> auth.users),
document_id: uuid | null, // set when complete
storage_path: text, // path in Supabase Storage
status: text, // 'pending' | 'rendering' | 'converting' | 'stitching' | 'complete' | 'failed'
current_step: text, // human-readable status
pages_total: int,
pages_processed: int,
error: text | null,
created_at: timestamptz,
updated_at: timestamptz
}
The current pipeline rejects PDFs with no selectable text. The vision-based approach inherently handles scanned PDFs because GPT-4o can read text from the rendered page image.
Detection: If extractPdf() returns pages with no text paragraphs but the rendered PNGs clearly contain text (determined by the vision model's response), the PDF is treated as scanned.
Quality note: OCR accuracy depends on scan quality. Low-resolution scans may produce imperfect text. The system will flag these with a notice: "This PDF appears to be a scan. Text accuracy may vary."
Render any html-doc back to PDF for download or print. Two approaches:
window.print() with @page CSS rules. The published site view already renders clean HTML -- adding print-optimized CSS gets us 80% of the way with zero server cost.@vercel/og's Satori engine or Puppeteer (via @sparticuz/chromium) to render HTML to PDF. Handles more complex layouts and embedded fonts. Requires a dedicated serverless function with higher memory.Page size configuration: When exporting, the user selects a target page size (letter, A4, legal). The CSS @page rule is set accordingly, and the content is reflowed to fit. Documents imported from PDFs retain their original page size as the default.
maxDuration: 300 (5 minutes). If it times out, the job record is left in its last-known state. A cleanup cron marks stale jobs (>10 min old, not complete) as failed.maxDuration: 300 to the server action's route segment config@napi-rs/canvas for server-side page renderinglib/pdf/render.ts -- render PDF pages to PNGlib/pdf/vision-convert.ts -- GPT-4o vision page-to-HTMLlib/pdf/stitch.ts -- merge page HTMLs into cohesive documentimportPdfDocument to use new pipeline with text-extraction fallbackpdf_import_jobs table to Supabase/api/pdf/process and /api/pdf/status| Component | Cost per Page | 10-Page Paper | 50-Page Report |
|---|---|---|---|
| GPT-4o vision input (image) | ~$0.003 | $0.03 | $0.15 |
| GPT-4o vision input (text ref) | ~$0.001 | $0.01 | $0.05 |
| GPT-4o output (HTML) | ~$0.015 | $0.15 | $0.75 |
| Stitching pass | -- | $0.05 | $0.15 |
| Supabase Storage | negligible | ~$0.001 | ~$0.005 |
| Total | ~$0.02/page | ~$0.24 | ~$1.10 |
Roughly 2-5x the cost of the current text-only redesign, but the quality improvement is dramatic. For a product that positions itself as "the way agents share documents," this is table-stakes investment.
lib/pdf/
extract.ts # existing -- text + image extraction (keep as fallback)
render.ts # NEW -- render PDF pages to PNG via pdfjs + @napi-rs/canvas
vision-convert.ts # NEW -- send page images to GPT-4o, get back HTML
stitch.ts # NEW -- merge per-page HTMLs into cohesive document
types.ts # NEW -- shared types for the PDF pipeline
app/api/pdf/
process/route.ts # NEW (Phase 3) -- background processing worker
status/route.ts # NEW (Phase 3) -- job status polling endpoint
scripts/sql/
pdf_import_jobs.sql # NEW (Phase 3) -- job tracking table
Spec authored by Amika for html-docs.com. May 25, 2026. Implementation begins with the immediate crash fix, followed by Phase 2 vision pipeline.