Implementation Spec
Rich Paste for Import Dialog
Accept HTML from clipboard when users paste from Google Docs, Notion, Word, or any webpage into the import flow.
2 new files
1 modified file
~200 lines
What Changes
The Paste tab currently uses a <textarea>, which only captures text/plain from the clipboard. When a user copies formatted content from Google Docs and pastes, all formatting is lost.
The fix: replace the textarea with a contenteditable div that intercepts the paste event and extracts text/html from the clipboard. A sanitizer cleans the HTML before passing it downstream. The UI stays visually identical -- same border, same placeholder, same sizing.
User-Facing Behavior
Paste from Google Docs, Notion, or Word -- headings, bold, italic, lists, tables, and links are preserved. Paste raw text or code -- works exactly as before. A subtle "Rich content detected" indicator appears when HTML is captured, with an option to switch to plain text view.
File 1: Clipboard HTML Sanitizer
lib/clipboard-sanitize.ts
Strips vendor wrappers, dangerous content, and unnecessary styling while preserving semantic structure.
export function sanitizeClipboardHtml(raw: string): {
html: string;
source: 'google-docs' | 'notion' | 'word' | 'generic';
} {
const source = detectSource(raw);
let html = raw;
html = stripVendorWrappers(html, source);
html = html.replace(/<script[\s\S]*?<\/script>/gi, '');
html = html.replace(/<style[\s\S]*?<\/style>/gi, '');
html = html.replace(/\son\w+="[^"]*"/gi, '');
html = html.replace(/<meta[^>]*>/gi, '');
html = html.replace(/<link[^>]*>/gi, '');
html = html.replace(/\sstyle="[^"]*"/gi, '');
html = html.replace(/\s(class|id|dir|role|data-[\w-]+)="[^"]*"/gi, '');
html = html.replace(/<span>([\s\S]*?)<\/span>/gi, '$1');
html = html.replace(/\n{3,}/g, '\n\n').trim();
return { html, source };
}
function detectSource(html: string): string {
if (html.includes('docs-internal-guid') ||
html.includes('google-sheets-html-origin'))
return 'google-docs';
if (html.includes('notion-') ||
html.includes('data-block-id'))
return 'notion';
if (html.includes('urn:schemas-microsoft-com:office') ||
html.includes('MsoNormal'))
return 'word';
return 'generic';
}
function stripVendorWrappers(html: string, source: string): string {
if (source === 'google-docs') {
html = html.replace(
/<b[^>]*docs-internal-guid[^>]*>([\s\S]*?)<\/b>/gi,
'$1'
);
html = html.replace(
/<google-sheets-html-origin>([\s\S]*?)<\/google-sheets-html-origin>/gi,
'$1'
);
}
if (source === 'word') {
html = html.replace(/<!--\[if[\s\S]*?<!\[endif\]-->/gi, '');
html = html.replace(/<\/?o:[^>]*>/gi, '');
html = html.replace(/<\/?v:[^>]*>/gi, '');
html = html.replace(/<\/?w:[^>]*>/gi, '');
}
return html;
}
File 2: Rich Paste Component
components/import/rich-paste-area.tsx
Drop-in replacement for the textarea. Looks identical, intercepts paste, extracts HTML when available.
'use client';
import { useRef, useState, useCallback } from 'react';
import { sanitizeClipboardHtml } from '@/lib/clipboard-sanitize';
interface RichPasteAreaProps {
value: string;
onChange: (value: string, isHtml: boolean) => void;
placeholder?: string;
className?: string;
}
export function RichPasteArea({
value, onChange, placeholder, className
}: RichPasteAreaProps) {
const [isRichContent, setIsRichContent] = useState(false);
const [source, setSource] = useState<string | null>(null);
const editableRef = useRef<HTMLDivElement>(null);
const handlePaste = useCallback((e: React.ClipboardEvent) => {
e.preventDefault();
const clipboardHtml = e.clipboardData.getData('text/html');
const clipboardText = e.clipboardData.getData('text/plain');
if (clipboardHtml && clipboardHtml.trim().length > 0) {
const { html, source: detectedSource } =
sanitizeClipboardHtml(clipboardHtml);
setIsRichContent(true);
setSource(detectedSource);
onChange(html, true);
if (editableRef.current) {
editableRef.current.innerText = html;
}
} else {
setIsRichContent(false);
setSource(null);
onChange(clipboardText, false);
if (editableRef.current) {
editableRef.current.innerText = clipboardText;
}
}
}, [onChange]);
const handleInput = useCallback(() => {
if (editableRef.current) {
const text = editableRef.current.innerText;
onChange(text, false);
if (isRichContent) {
setIsRichContent(false);
setSource(null);
}
}
}, [onChange, isRichContent]);
const sourceLabel: Record<string, string> = {
'google-docs': 'Google Docs',
'notion': 'Notion',
'word': 'Microsoft Word',
'generic': 'Rich content',
};
return (
<div className="relative">
<div
ref={editableRef}
contentEditable
onPaste={handlePaste}
onInput={handleInput}
data-placeholder={placeholder}
className={`
min-h-[200px] max-h-[400px] overflow-y-auto
w-full rounded-xl border border-border bg-background
px-4 py-3 text-sm font-mono
focus:outline-none focus:ring-2 focus:ring-ring
empty:before:content-[attr(data-placeholder)]
empty:before:text-muted-foreground
${className ?? ''}
`}
/>
{isRichContent && (
<div className="absolute top-2 right-2 flex items-center
gap-2 rounded-lg bg-primary/10 px-2.5 py-1 text-xs
font-medium text-primary">
<span>
{sourceLabel[source ?? 'generic']} detected
</span>
<button
onClick={() => {
if (editableRef.current) {
const plain = editableRef.current.innerText;
onChange(plain, false);
setIsRichContent(false);
}
}}
className="text-muted-foreground hover:text-foreground
underline underline-offset-2"
>
Use plain text
</button>
</div>
)}
</div>
);
}
File 3: Integration
Existing import dialog -- Paste tab
Swap the textarea for the new component. The isHtml flag tells the downstream processing whether to treat the input as HTML or plain text.
<textarea
value={pasteContent}
onChange={(e) => setPasteContent(e.target.value)}
placeholder="Paste your content here..."
className="min-h-[200px] w-full rounded-xl border ..."
/>
<RichPasteArea
value={pasteContent}
onChange={(content, isHtml) => {
setPasteContent(content);
setContentIsHtml(isHtml);
}}
placeholder="Paste your content here..."
/>
const contentType = contentIsHtml ? 'html' : 'text';
What the Sanitizer Handles
Google Docs
Wraps everything in <b id="docs-internal-guid-...">. Uses <span style="..."> for all formatting (font-weight for bold, font-style for italic). The sanitizer strips the wrapper, drops inline styles, and relies on the semantic tags that Docs also includes (<h1>, <ul>, <a>, etc.). Tables come through as real <table> elements.
Notion
Clean HTML with data-block-id attributes on each block. Headings, lists, and toggles are well-structured. Main cleanup is stripping the data attributes and Notion-specific class names. Notion also includes a plain text fallback, so the detection is important -- without it, users would get the stripped version.
Microsoft Word
The messiest source. Includes XML namespace declarations (o:, v:, w:), conditional comments (<!--[if gte mso 9]>), and MsoNormal class names on everything. The sanitizer strips all of this. Formatting comes through as inline styles rather than semantic tags, so after style-stripping, bold/italic may be lost. An enhancement would be to convert font-weight: bold to <strong> before stripping styles.
Generic Web Pages
Wildly variable. The strategy is conservative: strip all styles, classes, IDs, and non-semantic attributes. What survives is the document structure (headings, paragraphs, lists, tables, links, emphasis). This is good enough for most cases and avoids injecting someone else's CSS into the doc.
Edge Cases
- User pastes raw HTML source code (e.g., from VS Code) -- the clipboard has
text/html but it's the rendered version of the code, not the code itself. Solution: VS Code puts the original in text/plain. When the detected source is generic and the plain text starts with <, prefer plain text. Or: let the "Use plain text" button handle it.
- User types after pasting -- the component reverts to plain text mode on any
input event after a rich paste. This prevents a confusing state where half the content is HTML and half is plain text.
- Multiple pastes -- each paste replaces the previous content (same as the current textarea behavior). If you want append behavior, the
handlePaste function would need to merge rather than replace.
- Images in clipboard HTML -- Google Docs includes base64
data: URIs for images. These survive the sanitizer but may be large. Consider stripping images or capping size. For v1, stripping <img> tags is the safest default and you can add image support later.
Optional Enhancement: Style-to-Semantic Conversion
For Word and some Google Docs content, formatting is expressed as inline styles rather than semantic HTML. A post-sanitization pass could convert these before stripping styles:
function inlineStylesToSemantic(html: string): string {
html = html.replace(
/<span[^>]*font-weight:\s*(bold|[7-9]00)[^>]*>([\s\S]*?)<\/span>/gi,
'<strong>$2</strong>'
);
html = html.replace(
/<span[^>]*font-style:\s*italic[^>]*>([\s\S]*?)<\/span>/gi,
'<em>$2</em>'
);
return html;
}
This ensures bold and italic survive the sanitization pass even from sources that don't use semantic tags.
Testing Checklist
- Paste from Google Docs with headings, bold, lists, and a table
- Paste from Notion with toggles and callouts
- Paste from Word with complex formatting
- Paste plain text (should work identically to current behavior)
- Paste raw HTML source code from an editor
- Click "Use plain text" after a rich paste
- Type after pasting (should revert to plain text mode)
- Paste content with images (should strip or handle gracefully)
- Paste an empty clipboard or whitespace-only content