PDF Support Layer

Technical Specification for html-docs.com — v1.0, May 25 2026

Goal: Near 1:1 visual reproduction of PDF documents as editable HTML. Academic papers, reports, slide decks, and scanned documents should look like they came from the same printer, but in a live, editable web format.

Current state: PDF import exists but extracts raw text with pdfjs, then asks an LLM to "redesign" it. This loses all visual fidelity (layout, typography, spacing, columns) and times out on files over ~5 MB due to the synchronous server-action pipeline running inside Vercel's function timeout.

01Architecture Overview

Client Upload Supabase Storage Background Worker Document | | | | |-- POST FormData ----->| | | | (raw PDF bytes) |-- store PDF blob ------->| | | | | | |<- 200 { jobId, docId} | | | | | | | | |-- trigger processing --->| | | | |-- render pages ------->| | | | (pdfjs -> PNG) | | | | | |-- poll GET /status -->| |-- vision API -------->| |<- { step, progress } | | (page image -> HTML)| | | | | | | |-- stitch & import --->| |<- { complete, docId } | | |

Key Design Decisions

Upload-first, process-later. The client uploads the raw PDF to Supabase Storage via a lightweight API route, gets back a job ID and placeholder document ID immediately. Processing happens in a separate long-running function (Vercel maxDuration: 300s).
Page-by-page vision. Each PDF page is rendered to a high-DPI PNG (2x scale, ~1400x1800px for letter-size), then sent to GPT-4o vision with a prompt that asks for faithful HTML reproduction of that single page. Pages are processed in parallel (batches of 4) to stay within rate limits while maximizing throughput.
Stitching pass. Individual page HTMLs are merged into a single cohesive document with consistent styling, cross-references, and a unified stylesheet. A final LLM pass normalizes the CSS and resolves any inter-page continuity issues (split paragraphs, table continuations).
Progress polling. Client polls a status endpoint every 2 seconds. Status includes: current step name, pages processed / total pages, estimated time remaining. No fake timers.

02PDF Page Rendering

Approach: pdfjs-dist + @napi-rs/canvas

pdfjs-dist supports server-side rendering via its page.render() API, but requires a Canvas implementation. @napi-rs/canvas provides prebuilt native binaries for linux-x64-gnu (Vercel's serverless runtime) without needing system-level dependencies like Cairo.

Fallback: If canvas rendering fails (e.g., exotic fonts, corrupt page), the page falls back to the existing text+image extraction pipeline. The vision API still gets the extracted content, just without the visual layout reference.

// lib/pdf/render.ts
import { createCanvas } from '@napi-rs/canvas'

async function renderPage(page: PDFPageProxy, scale: number = 2.0): Promise<Buffer> {
  const viewport = page.getViewport({ scale })
  const canvas = createCanvas(viewport.width, viewport.height)
  const context = canvas.getContext('2d')

  await page.render({ canvasContext: context, viewport }).promise

  return canvas.toBuffer('image/png')
}

Scale factor: 2.0x gives good visual fidelity for vision model input without exceeding GPT-4o's image token budget (~1400x1800px for letter-size = ~1200 tokens per page image at "high" detail).

Page Size Detection

Format	Points (W x H)	Rendered PNG @ 2x	CSS Equivalent
US Letter	612 x 792	1224 x 1584	8.5in x 11in
A4	595 x 842	1190 x 1684	210mm x 297mm
Legal	612 x 1008	1224 x 2016	8.5in x 14in
Custom	Arbitrary	Capped at 2000px wide	Computed from ratio

Page dimensions are read from page.getViewport() and stored per-page. The final HTML preserves these dimensions via CSS @page rules and max-width on the content container, so the document looks correct at all screen sizes and prints accurately.

03Vision-Based HTML Reproduction

The Core Prompt

Each page image is sent to GPT-4o (or gpt-4.1 when available) with this instruction pattern:

System: You reproduce PDF pages as faithful, semantic HTML+CSS.
Your output must visually match the source image as closely as possible.

Rules:
- Use semantic HTML: h1-h6, p, table, figure, blockquote, ul/ol, etc.
- Reproduce the exact visual layout: column count, spacing, alignment
- Preserve all text verbatim -- do not summarize, rewrite, or omit
- For math/equations: use inline SVG or HTML entity approximations
- For figures: output a <figure> placeholder with [FIGURE_N] marker
  and the caption text. Figures will be injected in post-processing.
- Tables must reproduce the exact cell structure and content
- Use CSS Grid or Flexbox for multi-column layouts
- All styles must be inline or in a single <style> block
- Typography: match the source font style (serif/sans), size ratios,
  weight, and color as closely as possible
- No external dependencies. No JavaScript. No remote fonts.

User: [page image attached]
Page {N} of {total}. Page dimensions: {width}pt x {height}pt.
{extracted_text_for_reference}

The extracted text from pdfjs is included as a reference to help with OCR accuracy -- the vision model can cross-check its reading against the programmatic text extraction. This catches cases where the image is slightly blurry but the text layer is clean.

Multi-Column Layout Strategy

Academic papers commonly use two-column layouts. The vision model is instructed to detect and reproduce these using CSS Grid:

.two-column {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1.5rem;
  max-width: 900px;
}
@media (max-width: 720px) {
  .two-column { grid-template-columns: 1fr; }
}

The stitching pass verifies column detection is consistent across pages and normalizes the grid definitions.

Table Extraction

Tables are reproduced by the vision model directly from the page image. The extracted text serves as a validation layer -- if the vision model's table content doesn't match the text extraction, the system flags it for a second pass. Complex tables (merged cells, nested headers) are handled by the vision model's spatial understanding of the image rather than programmatic detection.

Math and Equations

Three-tier approach:

Simple inline math: HTML entities and Unicode (superscripts, subscripts, Greek letters, operators)
Display equations: Inline SVG generated by the vision model, reproducing the visual appearance
Complex notation: The page image section containing the equation is cropped and embedded as an <img> with appropriate alt text

Future: Add MathML or KaTeX rendering as a post-processing option when the user opts in.

04Figure and Image Handling

The current pipeline extracts raster images from the PDF's internal stream using pdfjs's operator list. This captures figures, charts, and photos but misses vector graphics and complex composited images.

New approach — hybrid extraction:

Raster extraction (existing): Pull embedded PNGs/JPEGs from the PDF stream. These are the source-of-truth for photographic content.
Rendered crop (new): For each [FIGURE_N] placeholder the vision model outputs, identify the bounding box on the rendered page image and crop that region. This captures vector graphics, charts, and composited content that raster extraction misses.
Priority: Use extracted raster when available (higher quality, no rendering artifacts). Fall back to rendered crop for everything else.

All images are uploaded to Supabase Storage and referenced by public URL in the final HTML, same as current pipeline.

05Processing Pipeline (Detailed)

Step	Action	Duration (est.)	Status Label
1	Upload PDF to Supabase Storage	1-3s	"Uploading PDF"
2	Parse PDF metadata (page count, title, dimensions)	1-2s	"Reading PDF"
3	Render pages to PNG (parallel, batches of 8)	2-10s	"Rendering page N of M"
4	Extract text + raster images (existing pipeline)	2-5s	"Extracting content"
5	Vision API calls (parallel, batches of 4)	15-90s	"Converting page N of M"
6	Upload extracted images to Storage	2-5s	"Processing figures"
7	Stitch pages + normalize CSS	5-15s	"Assembling document"
8	Create document + editable regions	1-2s	"Finalizing"

Total estimated time: 30-120 seconds for a typical 10-page academic paper. The client sees real progress at every step.

Background Processing Architecture

On Vercel, the cleanest approach is a dedicated API route with maxDuration: 300 that acts as the processing worker. The flow:

Client calls importPdfDocument server action, which uploads the PDF to Storage and creates a job record in a pdf_import_jobs table (Supabase), then returns immediately.
Server action fires a fetch() to the internal processing route (/api/pdf/process) with the job ID. This is fire-and-forget -- the server action doesn't await the response.
Processing route does the heavy work (render, vision, stitch, import). Updates the job record at each step.
Client polls /api/pdf/status?jobId=X every 2 seconds.

// Supabase table: pdf_import_jobs
{
  id: uuid (PK),
  user_id: uuid (FK -> auth.users),
  document_id: uuid | null,     // set when complete
  storage_path: text,           // path in Supabase Storage
  status: text,                 // 'pending' | 'rendering' | 'converting' | 'stitching' | 'complete' | 'failed'
  current_step: text,           // human-readable status
  pages_total: int,
  pages_processed: int,
  error: text | null,
  created_at: timestamptz,
  updated_at: timestamptz
}

06Scanned PDF / OCR Fallback

The current pipeline rejects PDFs with no selectable text. The vision-based approach inherently handles scanned PDFs because GPT-4o can read text from the rendered page image.

Detection: If extractPdf() returns pages with no text paragraphs but the rendered PNGs clearly contain text (determined by the vision model's response), the PDF is treated as scanned.

Quality note: OCR accuracy depends on scan quality. Low-resolution scans may produce imperfect text. The system will flag these with a notice: "This PDF appears to be a scan. Text accuracy may vary."

07PDF Export (HTML to PDF)

Render any html-doc back to PDF for download or print. Two approaches:

Browser-side (preferred for V1): Use window.print() with @page CSS rules. The published site view already renders clean HTML -- adding print-optimized CSS gets us 80% of the way with zero server cost.
Server-side (V2): Use @vercel/og's Satori engine or Puppeteer (via @sparticuz/chromium) to render HTML to PDF. Handles more complex layouts and embedded fonts. Requires a dedicated serverless function with higher memory.

Page size configuration: When exporting, the user selects a target page size (letter, A4, legal). The CSS @page rule is set accordingly, and the content is reflowed to fit. Documents imported from PDFs retain their original page size as the default.

08Error Handling and Resilience

Function timeout: Processing route has maxDuration: 300 (5 minutes). If it times out, the job record is left in its last-known state. A cleanup cron marks stale jobs (>10 min old, not complete) as failed.
Vision API failures: Individual page failures are retried once. If a page fails twice, it falls back to the text-extraction redesign for that page only. Other pages still use vision output.
Corrupt PDFs: pdfjs parse failure returns a clear error. Partially corrupt PDFs (some pages readable) process the good pages and mark the rest as "[Page could not be processed]".
Memory pressure: Pages are processed and flushed sequentially to keep peak memory low. PNGs are not all held in memory simultaneously.
Rate limiting: GPT-4o vision has per-minute rate limits. The batch-of-4 parallelism with backoff handles this. Very long documents (50+ pages) may hit rate limits and take longer.

09Migration Path

Phase 1: Immediate Fix (today)

Add maxDuration: 300 to the server action's route segment config
Increase the OpenAI timeout to handle larger documents
Add better error messages for timeout failures
This unblocks the 7.3 MB PDF crash without changing architecture

Phase 2: Vision Pipeline (this week)

Add @napi-rs/canvas for server-side page rendering
New lib/pdf/render.ts -- render PDF pages to PNG
New lib/pdf/vision-convert.ts -- GPT-4o vision page-to-HTML
New lib/pdf/stitch.ts -- merge page HTMLs into cohesive document
Update importPdfDocument to use new pipeline with text-extraction fallback
Keep processing synchronous in the server action for now (maxDuration: 300 gives enough room)

Phase 3: Async Processing (next week)

Add pdf_import_jobs table to Supabase
New API routes: /api/pdf/process and /api/pdf/status
Update client to upload-then-poll pattern
Real progress tracking in the UI

Phase 4: Polish (ongoing)

PDF export (HTML to PDF)
Scanned PDF OCR quality improvements
Math/equation rendering with KaTeX
Batch import (multiple PDFs)
Page size selection in export dialog

10Cost Analysis

Component	Cost per Page	10-Page Paper	50-Page Report
GPT-4o vision input (image)	~$0.003	$0.03	$0.15
GPT-4o vision input (text ref)	~$0.001	$0.01	$0.05
GPT-4o output (HTML)	~$0.015	$0.15	$0.75
Stitching pass	--	$0.05	$0.15
Supabase Storage	negligible	~$0.001	~$0.005
Total	~$0.02/page	~$0.24	~$1.10

Roughly 2-5x the cost of the current text-only redesign, but the quality improvement is dramatic. For a product that positions itself as "the way agents share documents," this is table-stakes investment.

11File Structure

lib/pdf/
  extract.ts        # existing -- text + image extraction (keep as fallback)
  render.ts         # NEW -- render PDF pages to PNG via pdfjs + @napi-rs/canvas
  vision-convert.ts # NEW -- send page images to GPT-4o, get back HTML
  stitch.ts         # NEW -- merge per-page HTMLs into cohesive document
  types.ts          # NEW -- shared types for the PDF pipeline

app/api/pdf/
  process/route.ts  # NEW (Phase 3) -- background processing worker
  status/route.ts   # NEW (Phase 3) -- job status polling endpoint

scripts/sql/
  pdf_import_jobs.sql # NEW (Phase 3) -- job tracking table

Spec authored by Amika for html-docs.com. May 25, 2026. Implementation begins with the immediate crash fix, followed by Phase 2 vision pipeline.