PDF Support Layer

Technical Specification for html-docs.com — v1.0, May 25 2026

Goal: Near 1:1 visual reproduction of PDF documents as editable HTML. Academic papers, reports, slide decks, and scanned documents should look like they came from the same printer, but in a live, editable web format.

Current state: PDF import exists but extracts raw text with pdfjs, then asks an LLM to "redesign" it. This loses all visual fidelity (layout, typography, spacing, columns) and times out on files over ~5 MB due to the synchronous server-action pipeline running inside Vercel's function timeout.

01Architecture Overview

Client Upload Supabase Storage Background Worker Document | | | | |-- POST FormData ----->| | | | (raw PDF bytes) |-- store PDF blob ------->| | | | | | |<- 200 { jobId, docId} | | | | | | | | |-- trigger processing --->| | | | |-- render pages ------->| | | | (pdfjs -> PNG) | | | | | |-- poll GET /status -->| |-- vision API -------->| |<- { step, progress } | | (page image -> HTML)| | | | | | | |-- stitch & import --->| |<- { complete, docId } | | |

Key Design Decisions

02PDF Page Rendering

Approach: pdfjs-dist + @napi-rs/canvas

pdfjs-dist supports server-side rendering via its page.render() API, but requires a Canvas implementation. @napi-rs/canvas provides prebuilt native binaries for linux-x64-gnu (Vercel's serverless runtime) without needing system-level dependencies like Cairo.

Fallback: If canvas rendering fails (e.g., exotic fonts, corrupt page), the page falls back to the existing text+image extraction pipeline. The vision API still gets the extracted content, just without the visual layout reference.

// lib/pdf/render.ts
import { createCanvas } from '@napi-rs/canvas'

async function renderPage(page: PDFPageProxy, scale: number = 2.0): Promise<Buffer> {
  const viewport = page.getViewport({ scale })
  const canvas = createCanvas(viewport.width, viewport.height)
  const context = canvas.getContext('2d')

  await page.render({ canvasContext: context, viewport }).promise

  return canvas.toBuffer('image/png')
}

Scale factor: 2.0x gives good visual fidelity for vision model input without exceeding GPT-4o's image token budget (~1400x1800px for letter-size = ~1200 tokens per page image at "high" detail).

Page Size Detection

FormatPoints (W x H)Rendered PNG @ 2xCSS Equivalent
US Letter612 x 7921224 x 15848.5in x 11in
A4595 x 8421190 x 1684210mm x 297mm
Legal612 x 10081224 x 20168.5in x 14in
CustomArbitraryCapped at 2000px wideComputed from ratio

Page dimensions are read from page.getViewport() and stored per-page. The final HTML preserves these dimensions via CSS @page rules and max-width on the content container, so the document looks correct at all screen sizes and prints accurately.

03Vision-Based HTML Reproduction

The Core Prompt

Each page image is sent to GPT-4o (or gpt-4.1 when available) with this instruction pattern:

System: You reproduce PDF pages as faithful, semantic HTML+CSS.
Your output must visually match the source image as closely as possible.

Rules:
- Use semantic HTML: h1-h6, p, table, figure, blockquote, ul/ol, etc.
- Reproduce the exact visual layout: column count, spacing, alignment
- Preserve all text verbatim -- do not summarize, rewrite, or omit
- For math/equations: use inline SVG or HTML entity approximations
- For figures: output a <figure> placeholder with [FIGURE_N] marker
  and the caption text. Figures will be injected in post-processing.
- Tables must reproduce the exact cell structure and content
- Use CSS Grid or Flexbox for multi-column layouts
- All styles must be inline or in a single <style> block
- Typography: match the source font style (serif/sans), size ratios,
  weight, and color as closely as possible
- No external dependencies. No JavaScript. No remote fonts.

User: [page image attached]
Page {N} of {total}. Page dimensions: {width}pt x {height}pt.
{extracted_text_for_reference}

The extracted text from pdfjs is included as a reference to help with OCR accuracy -- the vision model can cross-check its reading against the programmatic text extraction. This catches cases where the image is slightly blurry but the text layer is clean.

Multi-Column Layout Strategy

Academic papers commonly use two-column layouts. The vision model is instructed to detect and reproduce these using CSS Grid:

.two-column {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1.5rem;
  max-width: 900px;
}
@media (max-width: 720px) {
  .two-column { grid-template-columns: 1fr; }
}

The stitching pass verifies column detection is consistent across pages and normalizes the grid definitions.

Table Extraction

Tables are reproduced by the vision model directly from the page image. The extracted text serves as a validation layer -- if the vision model's table content doesn't match the text extraction, the system flags it for a second pass. Complex tables (merged cells, nested headers) are handled by the vision model's spatial understanding of the image rather than programmatic detection.

Math and Equations

Three-tier approach:

  1. Simple inline math: HTML entities and Unicode (superscripts, subscripts, Greek letters, operators)
  2. Display equations: Inline SVG generated by the vision model, reproducing the visual appearance
  3. Complex notation: The page image section containing the equation is cropped and embedded as an <img> with appropriate alt text

Future: Add MathML or KaTeX rendering as a post-processing option when the user opts in.

04Figure and Image Handling

The current pipeline extracts raster images from the PDF's internal stream using pdfjs's operator list. This captures figures, charts, and photos but misses vector graphics and complex composited images.

New approach — hybrid extraction:

  1. Raster extraction (existing): Pull embedded PNGs/JPEGs from the PDF stream. These are the source-of-truth for photographic content.
  2. Rendered crop (new): For each [FIGURE_N] placeholder the vision model outputs, identify the bounding box on the rendered page image and crop that region. This captures vector graphics, charts, and composited content that raster extraction misses.
  3. Priority: Use extracted raster when available (higher quality, no rendering artifacts). Fall back to rendered crop for everything else.

All images are uploaded to Supabase Storage and referenced by public URL in the final HTML, same as current pipeline.

05Processing Pipeline (Detailed)

StepActionDuration (est.)Status Label
1Upload PDF to Supabase Storage1-3s"Uploading PDF"
2Parse PDF metadata (page count, title, dimensions)1-2s"Reading PDF"
3Render pages to PNG (parallel, batches of 8)2-10s"Rendering page N of M"
4Extract text + raster images (existing pipeline)2-5s"Extracting content"
5Vision API calls (parallel, batches of 4)15-90s"Converting page N of M"
6Upload extracted images to Storage2-5s"Processing figures"
7Stitch pages + normalize CSS5-15s"Assembling document"
8Create document + editable regions1-2s"Finalizing"

Total estimated time: 30-120 seconds for a typical 10-page academic paper. The client sees real progress at every step.

Background Processing Architecture

On Vercel, the cleanest approach is a dedicated API route with maxDuration: 300 that acts as the processing worker. The flow:

  1. Client calls importPdfDocument server action, which uploads the PDF to Storage and creates a job record in a pdf_import_jobs table (Supabase), then returns immediately.
  2. Server action fires a fetch() to the internal processing route (/api/pdf/process) with the job ID. This is fire-and-forget -- the server action doesn't await the response.
  3. Processing route does the heavy work (render, vision, stitch, import). Updates the job record at each step.
  4. Client polls /api/pdf/status?jobId=X every 2 seconds.
// Supabase table: pdf_import_jobs
{
  id: uuid (PK),
  user_id: uuid (FK -> auth.users),
  document_id: uuid | null,     // set when complete
  storage_path: text,           // path in Supabase Storage
  status: text,                 // 'pending' | 'rendering' | 'converting' | 'stitching' | 'complete' | 'failed'
  current_step: text,           // human-readable status
  pages_total: int,
  pages_processed: int,
  error: text | null,
  created_at: timestamptz,
  updated_at: timestamptz
}

06Scanned PDF / OCR Fallback

The current pipeline rejects PDFs with no selectable text. The vision-based approach inherently handles scanned PDFs because GPT-4o can read text from the rendered page image.

Detection: If extractPdf() returns pages with no text paragraphs but the rendered PNGs clearly contain text (determined by the vision model's response), the PDF is treated as scanned.

Quality note: OCR accuracy depends on scan quality. Low-resolution scans may produce imperfect text. The system will flag these with a notice: "This PDF appears to be a scan. Text accuracy may vary."

07PDF Export (HTML to PDF)

Render any html-doc back to PDF for download or print. Two approaches:

  1. Browser-side (preferred for V1): Use window.print() with @page CSS rules. The published site view already renders clean HTML -- adding print-optimized CSS gets us 80% of the way with zero server cost.
  2. Server-side (V2): Use @vercel/og's Satori engine or Puppeteer (via @sparticuz/chromium) to render HTML to PDF. Handles more complex layouts and embedded fonts. Requires a dedicated serverless function with higher memory.

Page size configuration: When exporting, the user selects a target page size (letter, A4, legal). The CSS @page rule is set accordingly, and the content is reflowed to fit. Documents imported from PDFs retain their original page size as the default.

08Error Handling and Resilience

09Migration Path

Phase 1: Immediate Fix (today)

Phase 2: Vision Pipeline (this week)

Phase 3: Async Processing (next week)

Phase 4: Polish (ongoing)

10Cost Analysis

ComponentCost per Page10-Page Paper50-Page Report
GPT-4o vision input (image)~$0.003$0.03$0.15
GPT-4o vision input (text ref)~$0.001$0.01$0.05
GPT-4o output (HTML)~$0.015$0.15$0.75
Stitching pass--$0.05$0.15
Supabase Storagenegligible~$0.001~$0.005
Total~$0.02/page~$0.24~$1.10

Roughly 2-5x the cost of the current text-only redesign, but the quality improvement is dramatic. For a product that positions itself as "the way agents share documents," this is table-stakes investment.

11File Structure

lib/pdf/
  extract.ts        # existing -- text + image extraction (keep as fallback)
  render.ts         # NEW -- render PDF pages to PNG via pdfjs + @napi-rs/canvas
  vision-convert.ts # NEW -- send page images to GPT-4o, get back HTML
  stitch.ts         # NEW -- merge per-page HTMLs into cohesive document
  types.ts          # NEW -- shared types for the PDF pipeline

app/api/pdf/
  process/route.ts  # NEW (Phase 3) -- background processing worker
  status/route.ts   # NEW (Phase 3) -- job status polling endpoint

scripts/sql/
  pdf_import_jobs.sql # NEW (Phase 3) -- job tracking table

Spec authored by Amika for html-docs.com. May 25, 2026. Implementation begins with the immediate crash fix, followed by Phase 2 vision pipeline.