Implementation Spec

Rich Paste for Import Dialog

Accept HTML from clipboard when users paste from Google Docs, Notion, Word, or any webpage into the import flow.

2 new files 1 modified file ~200 lines

What Changes

The Paste tab currently uses a <textarea>, which only captures text/plain from the clipboard. When a user copies formatted content from Google Docs and pastes, all formatting is lost.

The fix: replace the textarea with a contenteditable div that intercepts the paste event and extracts text/html from the clipboard. A sanitizer cleans the HTML before passing it downstream. The UI stays visually identical -- same border, same placeholder, same sizing.

User-Facing Behavior

Paste from Google Docs, Notion, or Word -- headings, bold, italic, lists, tables, and links are preserved. Paste raw text or code -- works exactly as before. A subtle "Rich content detected" indicator appears when HTML is captured, with an option to switch to plain text view.


File 1: Clipboard HTML Sanitizer

lib/clipboard-sanitize.ts

Strips vendor wrappers, dangerous content, and unnecessary styling while preserving semantic structure.

/** * Sanitize HTML pasted from clipboard. * Handles Google Docs, Notion, Word, and generic browser clipboard HTML. * Preserves: headings, paragraphs, lists, tables, links, bold, italic, code. * Strips: scripts, styles, event handlers, vendor wrappers, tracking pixels. */ export function sanitizeClipboardHtml(raw: string): { html: string; source: 'google-docs' | 'notion' | 'word' | 'generic'; } { const source = detectSource(raw); let html = raw; // 1. Strip vendor wrappers html = stripVendorWrappers(html, source); // 2. Remove dangerous content html = html.replace(/<script[\s\S]*?<\/script>/gi, ''); html = html.replace(/<style[\s\S]*?<\/style>/gi, ''); html = html.replace(/\son\w+="[^"]*"/gi, ''); html = html.replace(/<meta[^>]*>/gi, ''); html = html.replace(/<link[^>]*>/gi, ''); // 3. Strip all inline styles (keep structure, lose presentation) html = html.replace(/\sstyle="[^"]*"/gi, ''); // 4. Strip class/id attributes (vendor noise) html = html.replace(/\s(class|id|dir|role|data-[\w-]+)="[^"]*"/gi, ''); // 5. Collapse empty spans left behind html = html.replace(/<span>([\s\S]*?)<\/span>/gi, '$1'); // 6. Normalize whitespace html = html.replace(/\n{3,}/g, '\n\n').trim(); return { html, source }; } function detectSource(html: string): string { if (html.includes('docs-internal-guid') || html.includes('google-sheets-html-origin')) return 'google-docs'; if (html.includes('notion-') || html.includes('data-block-id')) return 'notion'; if (html.includes('urn:schemas-microsoft-com:office') || html.includes('MsoNormal')) return 'word'; return 'generic'; } function stripVendorWrappers(html: string, source: string): string { if (source === 'google-docs') { // Google Docs wraps everything in <b id="docs-internal-guid-..."> html = html.replace( /<b[^>]*docs-internal-guid[^>]*>([\s\S]*?)<\/b>/gi, '$1' ); // Google Sheets wrapper html = html.replace( /<google-sheets-html-origin>([\s\S]*?)<\/google-sheets-html-origin>/gi, '$1' ); } if (source === 'word') { // Strip Word's XML namespace declarations and conditional comments html = html.replace(/<!--\[if[\s\S]*?<!\[endif\]-->/gi, ''); html = html.replace(/<\/?o:[^>]*>/gi, ''); html = html.replace(/<\/?v:[^>]*>/gi, ''); html = html.replace(/<\/?w:[^>]*>/gi, ''); } return html; }

File 2: Rich Paste Component

components/import/rich-paste-area.tsx

Drop-in replacement for the textarea. Looks identical, intercepts paste, extracts HTML when available.

'use client'; import { useRef, useState, useCallback } from 'react'; import { sanitizeClipboardHtml } from '@/lib/clipboard-sanitize'; interface RichPasteAreaProps { value: string; onChange: (value: string, isHtml: boolean) => void; placeholder?: string; className?: string; } export function RichPasteArea({ value, onChange, placeholder, className }: RichPasteAreaProps) { const [isRichContent, setIsRichContent] = useState(false); const [source, setSource] = useState<string | null>(null); const editableRef = useRef<HTMLDivElement>(null); const handlePaste = useCallback((e: React.ClipboardEvent) => { e.preventDefault(); const clipboardHtml = e.clipboardData.getData('text/html'); const clipboardText = e.clipboardData.getData('text/plain'); if (clipboardHtml && clipboardHtml.trim().length > 0) { // Rich content detected -- sanitize and use HTML const { html, source: detectedSource } = sanitizeClipboardHtml(clipboardHtml); setIsRichContent(true); setSource(detectedSource); onChange(html, true); // Show sanitized HTML in the editable area if (editableRef.current) { editableRef.current.innerText = html; } } else { // Plain text fallback setIsRichContent(false); setSource(null); onChange(clipboardText, false); if (editableRef.current) { editableRef.current.innerText = clipboardText; } } }, [onChange]); const handleInput = useCallback(() => { if (editableRef.current) { const text = editableRef.current.innerText; onChange(text, false); // If user types after pasting, revert to plain text mode if (isRichContent) { setIsRichContent(false); setSource(null); } } }, [onChange, isRichContent]); const sourceLabel: Record<string, string> = { 'google-docs': 'Google Docs', 'notion': 'Notion', 'word': 'Microsoft Word', 'generic': 'Rich content', }; return ( <div className="relative"> <div ref={editableRef} contentEditable onPaste={handlePaste} onInput={handleInput} data-placeholder={placeholder} className={` min-h-[200px] max-h-[400px] overflow-y-auto w-full rounded-xl border border-border bg-background px-4 py-3 text-sm font-mono focus:outline-none focus:ring-2 focus:ring-ring empty:before:content-[attr(data-placeholder)] empty:before:text-muted-foreground ${className ?? ''} `} /> {isRichContent && ( <div className="absolute top-2 right-2 flex items-center gap-2 rounded-lg bg-primary/10 px-2.5 py-1 text-xs font-medium text-primary"> <span> {sourceLabel[source ?? 'generic']} detected </span> <button onClick={() => { if (editableRef.current) { const plain = editableRef.current.innerText; onChange(plain, false); setIsRichContent(false); } }} className="text-muted-foreground hover:text-foreground underline underline-offset-2" > Use plain text </button> </div> )} </div> ); }

File 3: Integration

Existing import dialog -- Paste tab

Swap the textarea for the new component. The isHtml flag tells the downstream processing whether to treat the input as HTML or plain text.

// Before <textarea value={pasteContent} onChange={(e) => setPasteContent(e.target.value)} placeholder="Paste your content here..." className="min-h-[200px] w-full rounded-xl border ..." /> // After <RichPasteArea value={pasteContent} onChange={(content, isHtml) => { setPasteContent(content); setContentIsHtml(isHtml); }} placeholder="Paste your content here..." /> // Then in the submit handler, pass the flag downstream: const contentType = contentIsHtml ? 'html' : 'text';

What the Sanitizer Handles

Google Docs

Wraps everything in <b id="docs-internal-guid-...">. Uses <span style="..."> for all formatting (font-weight for bold, font-style for italic). The sanitizer strips the wrapper, drops inline styles, and relies on the semantic tags that Docs also includes (<h1>, <ul>, <a>, etc.). Tables come through as real <table> elements.

Notion

Clean HTML with data-block-id attributes on each block. Headings, lists, and toggles are well-structured. Main cleanup is stripping the data attributes and Notion-specific class names. Notion also includes a plain text fallback, so the detection is important -- without it, users would get the stripped version.

Microsoft Word

The messiest source. Includes XML namespace declarations (o:, v:, w:), conditional comments (<!--[if gte mso 9]>), and MsoNormal class names on everything. The sanitizer strips all of this. Formatting comes through as inline styles rather than semantic tags, so after style-stripping, bold/italic may be lost. An enhancement would be to convert font-weight: bold to <strong> before stripping styles.

Generic Web Pages

Wildly variable. The strategy is conservative: strip all styles, classes, IDs, and non-semantic attributes. What survives is the document structure (headings, paragraphs, lists, tables, links, emphasis). This is good enough for most cases and avoids injecting someone else's CSS into the doc.


Edge Cases


Optional Enhancement: Style-to-Semantic Conversion

For Word and some Google Docs content, formatting is expressed as inline styles rather than semantic HTML. A post-sanitization pass could convert these before stripping styles:

function inlineStylesToSemantic(html: string): string { // Convert style="font-weight: bold" or font-weight: 700 to <strong> html = html.replace( /<span[^>]*font-weight:\s*(bold|[7-9]00)[^>]*>([\s\S]*?)<\/span>/gi, '<strong>$2</strong>' ); // Convert style="font-style: italic" to <em> html = html.replace( /<span[^>]*font-style:\s*italic[^>]*>([\s\S]*?)<\/span>/gi, '<em>$2</em>' ); return html; } // Call this BEFORE stripping inline styles in sanitizeClipboardHtml

This ensures bold and italic survive the sanitization pass even from sources that don't use semantic tags.


Testing Checklist