Implementation Spec

Rich Paste for Import Dialog

Accept HTML from clipboard when users paste from Google Docs, Notion, Word, or any webpage into the import flow.

2 new files 1 modified file ~200 lines

What Changes

The Paste tab currently uses a <textarea>, which only captures text/plain from the clipboard. When a user copies formatted content from Google Docs and pastes, all formatting is lost.

The fix: replace the textarea with a contenteditable div that intercepts the paste event and extracts text/html from the clipboard. A sanitizer cleans the HTML before passing it downstream. The UI stays visually identical -- same border, same placeholder, same sizing.

User-Facing Behavior

Paste from Google Docs, Notion, or Word -- headings, bold, italic, lists, tables, and links are preserved. Paste raw text or code -- works exactly as before. A subtle "Rich content detected" indicator appears when HTML is captured, with an option to switch to plain text view.

File 1: Clipboard HTML Sanitizer

lib/clipboard-sanitize.ts

Strips vendor wrappers, dangerous content, and unnecessary styling while preserving semantic structure.

/**
 * Sanitize HTML pasted from clipboard.
 * Handles Google Docs, Notion, Word, and generic browser clipboard HTML.
 * Preserves: headings, paragraphs, lists, tables, links, bold, italic, code.
 * Strips: scripts, styles, event handlers, vendor wrappers, tracking pixels.
 */

export function sanitizeClipboardHtml(raw: string): {
  html: string;
  source: 'google-docs' | 'notion' | 'word' | 'generic';
} {
  const source = detectSource(raw);
  let html = raw;

  // 1. Strip vendor wrappers
  html = stripVendorWrappers(html, source);

  // 2. Remove dangerous content
  html = html.replace(/<script[\s\S]*?<\/script>/gi, '');
  html = html.replace(/<style[\s\S]*?<\/style>/gi, '');
  html = html.replace(/\son\w+="[^"]*"/gi, '');
  html = html.replace(/<meta[^>]*>/gi, '');
  html = html.replace(/<link[^>]*>/gi, '');

  // 3. Strip all inline styles (keep structure, lose presentation)
  html = html.replace(/\sstyle="[^"]*"/gi, '');

  // 4. Strip class/id attributes (vendor noise)
  html = html.replace(/\s(class|id|dir|role|data-[\w-]+)="[^"]*"/gi, '');

  // 5. Collapse empty spans left behind
  html = html.replace(/<span>([\s\S]*?)<\/span>/gi, '$1');

  // 6. Normalize whitespace
  html = html.replace(/\n{3,}/g, '\n\n').trim();

  return { html, source };
}

function detectSource(html: string): string {
  if (html.includes('docs-internal-guid') ||
      html.includes('google-sheets-html-origin'))
    return 'google-docs';
  if (html.includes('notion-') ||
      html.includes('data-block-id'))
    return 'notion';
  if (html.includes('urn:schemas-microsoft-com:office') ||
      html.includes('MsoNormal'))
    return 'word';
  return 'generic';
}

function stripVendorWrappers(html: string, source: string): string {
  if (source === 'google-docs') {
    // Google Docs wraps everything in <b id="docs-internal-guid-...">
    html = html.replace(
      /<b[^>]*docs-internal-guid[^>]*>([\s\S]*?)<\/b>/gi,
      '$1'
    );
    // Google Sheets wrapper
    html = html.replace(
      /<google-sheets-html-origin>([\s\S]*?)<\/google-sheets-html-origin>/gi,
      '$1'
    );
  }

  if (source === 'word') {
    // Strip Word's XML namespace declarations and conditional comments
    html = html.replace(/<!--\[if[\s\S]*?<!\[endif\]-->/gi, '');
    html = html.replace(/<\/?o:[^>]*>/gi, '');
    html = html.replace(/<\/?v:[^>]*>/gi, '');
    html = html.replace(/<\/?w:[^>]*>/gi, '');
  }

  return html;
}

File 2: Rich Paste Component

components/import/rich-paste-area.tsx

Drop-in replacement for the textarea. Looks identical, intercepts paste, extracts HTML when available.

'use client';

import { useRef, useState, useCallback } from 'react';
import { sanitizeClipboardHtml } from '@/lib/clipboard-sanitize';

interface RichPasteAreaProps {
  value: string;
  onChange: (value: string, isHtml: boolean) => void;
  placeholder?: string;
  className?: string;
}

export function RichPasteArea({
  value, onChange, placeholder, className
}: RichPasteAreaProps) {
  const [isRichContent, setIsRichContent] = useState(false);
  const [source, setSource] = useState<string | null>(null);
  const editableRef = useRef<HTMLDivElement>(null);

  const handlePaste = useCallback((e: React.ClipboardEvent) => {
    e.preventDefault();

    const clipboardHtml = e.clipboardData.getData('text/html');
    const clipboardText = e.clipboardData.getData('text/plain');

    if (clipboardHtml && clipboardHtml.trim().length > 0) {
      // Rich content detected -- sanitize and use HTML
      const { html, source: detectedSource } =
        sanitizeClipboardHtml(clipboardHtml);
      setIsRichContent(true);
      setSource(detectedSource);
      onChange(html, true);

      // Show sanitized HTML in the editable area
      if (editableRef.current) {
        editableRef.current.innerText = html;
      }
    } else {
      // Plain text fallback
      setIsRichContent(false);
      setSource(null);
      onChange(clipboardText, false);

      if (editableRef.current) {
        editableRef.current.innerText = clipboardText;
      }
    }
  }, [onChange]);

  const handleInput = useCallback(() => {
    if (editableRef.current) {
      const text = editableRef.current.innerText;
      onChange(text, false);

      // If user types after pasting, revert to plain text mode
      if (isRichContent) {
        setIsRichContent(false);
        setSource(null);
      }
    }
  }, [onChange, isRichContent]);

  const sourceLabel: Record<string, string> = {
    'google-docs': 'Google Docs',
    'notion': 'Notion',
    'word': 'Microsoft Word',
    'generic': 'Rich content',
  };

  return (
    <div className="relative">
      <div
        ref={editableRef}
        contentEditable
        onPaste={handlePaste}
        onInput={handleInput}
        data-placeholder={placeholder}
        className={`
          min-h-[200px] max-h-[400px] overflow-y-auto
          w-full rounded-xl border border-border bg-background
          px-4 py-3 text-sm font-mono
          focus:outline-none focus:ring-2 focus:ring-ring
          empty:before:content-[attr(data-placeholder)]
          empty:before:text-muted-foreground
          ${className ?? ''}
        `}
      />

      {isRichContent && (
        <div className="absolute top-2 right-2 flex items-center
              gap-2 rounded-lg bg-primary/10 px-2.5 py-1 text-xs
              font-medium text-primary">
          <span>
            {sourceLabel[source ?? 'generic']} detected
          </span>
          <button
            onClick={() => {
              if (editableRef.current) {
                const plain = editableRef.current.innerText;
                onChange(plain, false);
                setIsRichContent(false);
              }
            }}
            className="text-muted-foreground hover:text-foreground
                     underline underline-offset-2"
          >
            Use plain text
          </button>
        </div>
      )}
    </div>
  );
}

File 3: Integration

Existing import dialog -- Paste tab

Swap the textarea for the new component. The isHtml flag tells the downstream processing whether to treat the input as HTML or plain text.

// Before
<textarea
  value={pasteContent}
  onChange={(e) => setPasteContent(e.target.value)}
  placeholder="Paste your content here..."
  className="min-h-[200px] w-full rounded-xl border ..."
/>

// After
<RichPasteArea
  value={pasteContent}
  onChange={(content, isHtml) => {
    setPasteContent(content);
    setContentIsHtml(isHtml);
  }}
  placeholder="Paste your content here..."
/>

// Then in the submit handler, pass the flag downstream:
const contentType = contentIsHtml ? 'html' : 'text';

What the Sanitizer Handles

Google Docs

Wraps everything in <b id="docs-internal-guid-...">. Uses <span style="..."> for all formatting (font-weight for bold, font-style for italic). The sanitizer strips the wrapper, drops inline styles, and relies on the semantic tags that Docs also includes (<h1>, <ul>, <a>, etc.). Tables come through as real <table> elements.

Notion

Clean HTML with data-block-id attributes on each block. Headings, lists, and toggles are well-structured. Main cleanup is stripping the data attributes and Notion-specific class names. Notion also includes a plain text fallback, so the detection is important -- without it, users would get the stripped version.

Microsoft Word

The messiest source. Includes XML namespace declarations (o:, v:, w:), conditional comments (<!--[if gte mso 9]>), and MsoNormal class names on everything. The sanitizer strips all of this. Formatting comes through as inline styles rather than semantic tags, so after style-stripping, bold/italic may be lost. An enhancement would be to convert font-weight: bold to <strong> before stripping styles.

Generic Web Pages

Wildly variable. The strategy is conservative: strip all styles, classes, IDs, and non-semantic attributes. What survives is the document structure (headings, paragraphs, lists, tables, links, emphasis). This is good enough for most cases and avoids injecting someone else's CSS into the doc.

Edge Cases

User pastes raw HTML source code (e.g., from VS Code) -- the clipboard has text/html but it's the rendered version of the code, not the code itself. Solution: VS Code puts the original in text/plain. When the detected source is generic and the plain text starts with <, prefer plain text. Or: let the "Use plain text" button handle it.
User types after pasting -- the component reverts to plain text mode on any input event after a rich paste. This prevents a confusing state where half the content is HTML and half is plain text.
Multiple pastes -- each paste replaces the previous content (same as the current textarea behavior). If you want append behavior, the handlePaste function would need to merge rather than replace.
Images in clipboard HTML -- Google Docs includes base64 data: URIs for images. These survive the sanitizer but may be large. Consider stripping images or capping size. For v1, stripping <img> tags is the safest default and you can add image support later.

Optional Enhancement: Style-to-Semantic Conversion

For Word and some Google Docs content, formatting is expressed as inline styles rather than semantic HTML. A post-sanitization pass could convert these before stripping styles:

function inlineStylesToSemantic(html: string): string {
  // Convert style="font-weight: bold" or font-weight: 700 to <strong>
  html = html.replace(
    /<span[^>]*font-weight:\s*(bold|[7-9]00)[^>]*>([\s\S]*?)<\/span>/gi,
    '<strong>$2</strong>'
  );

  // Convert style="font-style: italic" to <em>
  html = html.replace(
    /<span[^>]*font-style:\s*italic[^>]*>([\s\S]*?)<\/span>/gi,
    '<em>$2</em>'
  );

  return html;
}

// Call this BEFORE stripping inline styles in sanitizeClipboardHtml

This ensures bold and italic survive the sanitization pass even from sources that don't use semantic tags.

Testing Checklist

Paste from Google Docs with headings, bold, lists, and a table
Paste from Notion with toggles and callouts
Paste from Word with complex formatting
Paste plain text (should work identically to current behavior)
Paste raw HTML source code from an editor
Click "Use plain text" after a rich paste
Type after pasting (should revert to plain text mode)
Paste content with images (should strip or handle gracefully)
Paste an empty clipboard or whitespace-only content