Cheat Sheet · Last Lap Before The Interview

Anthropic Interview — All Seven Problems

Stage-by-stage summaries, dense pseudocode, complexity, and the discussion follow-ups. Skim end-to-end first, then drill the three you flagged as priority: image processor, web crawler, LRU.

01

Web Crawler

priority · I/O-bound · threads

Given a start URL and a provider with get_links(url), return every URL reachable that shares the start's hostname. Then make it parallel.

Stage 1 single-threaded BFS

Approach. Standard BFS — queue + visited set. Mark visited at enqueue time, not at pop (otherwise duplicates leak in). Strip # and ? before comparing. Skip off-host links.

from collections import deque

def crawl_same_host(start_url, provider):
    target_host = provider.get_hostname(start_url)
    visited = {start_url}
    queue = deque([start_url])

    while queue:
        url = queue.popleft()
        for link in provider.get_links(url):
            link = link.split("#", 1)[0].split("?", 1)[0]
            if provider.get_hostname(link) != target_host: continue
            if link in visited: continue
            visited.add(link)
            queue.append(link)
    return visited

Time O(V + E) Space O(V)

Stage 2 multithreaded

The hard part is termination. "Queue empty" is NOT a stop signal — a worker can pull the last URL, then enqueue children before another worker checks. Track in-flight work, not just queue depth.

Concurrency primitives

  • queue.Queue — already thread-safe per put/get.
  • threading.Lock around visited set (Python set isn't thread-safe).
  • in_flight counter + lock + threading.Event for done signal.
  • ThreadPoolExecutor(max_workers=N) for bounded parallelism.
from concurrent.futures import ThreadPoolExecutor
from queue import Queue, Empty
import threading

def crawl_same_host_mt(start_url, provider, num_workers=4):
    target_host = provider.get_hostname(start_url)
    visited = {start_url}
    visited_lock = threading.Lock()
    in_flight = [1]
    in_flight_lock = threading.Lock()
    done_event = threading.Event()
    q = Queue()
    q.put(start_url)

    def worker():
        while not done_event.is_set():
            try: url = q.get(timeout=0.05)
            except Empty: continue
            try:
                for link in provider.get_links(url):
                    link = link.split("#", 1)[0].split("?", 1)[0]
                    if provider.get_hostname(link) != target_host: continue
                    with visited_lock:
                        if link in visited: continue
                        visited.add(link)
                    with in_flight_lock: in_flight[0] += 1
                    q.put(link)
            finally:
                with in_flight_lock:
                    in_flight[0] -= 1
                    if in_flight[0] == 0: done_event.set()

    with ThreadPoolExecutor(max_workers=num_workers) as ex:
        futures = [ex.submit(worker) for _ in range(num_workers)]
        done_event.wait()
        for f in futures: f.result()
    return visited
Gotchas: use queue.get(timeout=...) not blocking get (need to poll done_event); always try/finally around the body so the in_flight decrement runs even on exception; in_flight = [1] not plain int (closure mutation), OR use nonlocal.

Time O(V + E) Wall-clock ≈ /num_workers

Stage 3 discussion follow-ups

  • Threads vs processes. I/O-bound (network/disk waits) → threads. CPU-bound → processes (GIL). Crawling is I/O-bound, so threads.
  • Distributed crawling. Partition by hostname via consistent hashing; central queue or shared Redis/Memcached for the URL frontier. Don't ship page content across machines — fetch locally.
  • Politeness. Per-domain rate limit (e.g. 1 req/sec), respect robots.txt + crawl-delay. Adaptive backoff on errors. Distributed coordination via shared rate-limit counter.
  • Duplicate content. Hash page bytes (MD5/SHA-256) for exact dupes; simhash / shingling for near-dupes; canonicalize URLs (strip tracking params, follow <link rel="canonical">).
02

Batch Image Processor

priority · CPU-bound · processes

Apply each transform JSON to each image. 6 transform types (3 with params). Helper get_output_path(image, transform) tells you where to save. Then parallelize.

Stage 1 sequential

Library: Pillow (easiest). Approach: enumerate image files, enumerate transforms, for each pair apply transforms in order, save.

Pillow API cheat sheet

  • Image.open(path) → image
  • ImageOps.grayscale(img) → mode 'L' (single channel — .convert('RGB') if next op expects RGB or saving as JPEG)
  • ImageOps.mirror(img) — flip horizontal
  • ImageOps.flip(img) — flip vertical
  • ImageOps.scale(img, factor) — scale by factor
  • img.filter(ImageFilter.GaussianBlur(radius))
  • img.rotate(angle, expand=True)expand=True to swap dims on 90°
from PIL import Image, ImageOps, ImageFilter
import os, json

def apply_transform(image_path, transform_path):
    with open(transform_path) as f:
        data = json.load(f)
    img = Image.open(image_path).copy()
    for t in data["transformations"]:
        kind = t["type"]
        if kind == "grayscale":        img = ImageOps.grayscale(img)
        elif kind == "flip_horizontal": img = ImageOps.mirror(img)
        elif kind == "flip_vertical":   img = ImageOps.flip(img)
        elif kind == "scale":            img = ImageOps.scale(img, t["factor"])
        elif kind == "blur":             img = img.filter(ImageFilter.GaussianBlur(t["radius"]))
        elif kind == "rotate":           img = img.rotate(t["angle"], expand=True)
    return img

def process_images(image_dir, transformation_dir, output_dir, get_output_path):
    transforms = [os.path.join(transformation_dir, f)
                  for f in os.listdir(transformation_dir)
                  if f.lower().endswith(".json")]
    images = [os.path.join(image_dir, f)
              for f in os.listdir(image_dir)
              if f.lower().endswith((".png", ".jpg", ".jpeg"))]

    for image in images:
        for transform in transforms:
            result = apply_transform(image, transform)
            output = get_output_path(image, transform)
            os.makedirs(os.path.dirname(output), exist_ok=True)
            result.save(output)
Gotchas: rotate defaults to expand=False (crops back to original canvas); .lower().endswith() for case-insensitive extension match; os.makedirs(os.path.dirname(out), exist_ok=True) before save (handles nested output paths); open(transform_path) not json.load(transform_path) — pass file object, not path.

Stage 2 parallel

This is CPU-bound (pixel ops). Use ProcessPoolExecutor, not threads. Each (image, transform) pair is independent. Worker MUST be module-scope (pickling) and should ship paths only — never decoded image buffers (huge IPC cost).

from concurrent.futures import ProcessPoolExecutor

# MODULE SCOPE — must be importable by name, not nested or lambda
def _worker(args):
    image_path, transform_path, output_path = args
    result = apply_transform(image_path, transform_path)
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    result.save(output_path)

def process_images_parallel(image_dir, transformation_dir, output_dir,
                            get_output_path, max_workers=None):
    jobs = [(img, tf, get_output_path(img, tf))
            for img in images
            for tf in transforms]
    with ProcessPoolExecutor(max_workers=max_workers) as ex:
        list(ex.map(_worker, jobs))  # list() forces exceptions to surface
Why list() the map? ex.map(...) returns a lazy iterator. Work IS submitted eagerly but exceptions from workers only surface as you iterate. list(...) drains the iterator → propagates errors.

Stage 3 discussion follow-ups

  • Threads vs processes nuance. Default rule: CPU-bound → processes. But Pillow does the heavy work in C and releases the GIL — so threads can also help! Counter: pickling decoded images for processes is expensive. Honest answer: measure both.
  • Memory. Decoded RGB image = width × height × 3 bytes. 8000×6000 ≈ 144 MB. Multiply by max_workers. Mitigations: cap workers, downscale early, tile-process large images.
  • Failure handling. One corrupt file shouldn't kill the batch. Per-item try/except, log, continue. Return success/failure manifest.
  • Incremental re-runs. Hash (image bytes, transform JSON) → deterministic output path. Skip if output exists with matching hash.
  • Scaling beyond one machine. Task queue (Celery / SQS / job DB). Embarrassingly parallel. GPU via OpenCV CUDA / cupy for huge batches.
03

LRU Cache (Memoization + Persistence)

priority · WAL · fsync

call(func, *args, **kwargs) memoizes. Capacity-bounded. Survives crashes — same entries AND same recency after restart.

Stage 1 bug-fix the starter

Bugs to find:

  • hash(args) fails on lists/dicts/sets. kwargs is itself a dict.
  • func.__name__ collides across classes — use __qualname__ + __module__.
  • No move_to_end on cache hit (recency not maintained).
  • No eviction when over capacity.
  • func() invoked while holding the lock → serializes all calls.

Key normalization (recursive, hashable canonical form)

def _normalize(v):
    if isinstance(v, dict):
        return tuple(sorted((k, _normalize(x)) for k, x in v.items()))
    if isinstance(v, (set, frozenset)):
        return tuple(sorted(_normalize(x) for x in v))
    if isinstance(v, (list, tuple)):
        return tuple(_normalize(x) for x in v)
    return v

def generate_key(func, *args, **kwargs):
    canonical = (f"{func.__module__}.{func.__qualname__}",
                 _normalize(args), _normalize(kwargs))
    return json.dumps(canonical)  # string key — JSON-roundtrip safe

Two-phase call (compute OUTSIDE lock)

def call(self, func, *args, **kwargs):
    key = self.generate_key(func, *args, **kwargs)
    with self._lock:
        if key in self._cache:
            self._cache.move_to_end(key)
            return self._cache[key]
    value = func(*args, **kwargs)  # outside lock — parallel computes OK
    with self._lock:
        if key in self._cache:  # re-check race
            return self._cache[key]
        self._cache[key] = value
        if len(self._cache) > self.capacity:
            evicted_key, _ = self._cache.popitem(last=False)  # LRU = oldest
        return value

O(1) per call (OrderedDict ops are O(1))

Stage 2 WAL persistence

Three event types — and HIT matters. PUT alone gives correct entries but wrong order; HIT records replay recency.

Write order: mutate memory FIRST, then write WAL + fsync, then return. Matches the "lose last in-flight write" contract. The reverse (write-first) gives "phantom commits" — disk has it but caller didn't see a return value.
def _append_journal(self, record):
    with open(self.path, "a") as f:
        f.write(json.dumps(record) + "\n")
        f.flush()              # Python buffer → OS
        os.fsync(f.fileno())   # OS page cache → physical disk

def _recover(self):
    if not os.path.exists(self.path): return
    with open(self.path) as f:
        for line in f:
            line = line.strip()
            if not line: continue
            try: rec = json.loads(line)
            except json.JSONDecodeError: break  # torn last line
            op = rec[0]
            if op == "PUT":
                self._cache[rec[1]] = rec[2]
                self._cache.move_to_end(rec[1])
                while len(self._cache) > self.capacity:
                    self._cache.popitem(last=False)
            elif op == "HIT":
                if rec[1] in self._cache: self._cache.move_to_end(rec[1])
            elif op == "EVICT":
                self._cache.pop(rec[1], None)  # safe pop
Why string keys? JSON roundtrip turns tuples into lists — and list is unhashable. Sidestep by making generate_key return json.dumps(canonical) from the start. Now keys are strings that survive the disk roundtrip cleanly.

Stage 3 discussion follow-ups

  • fsync is THE durability primitive. write → Python buffer. flush → OS page cache. fsync → physical disk. Without fsync, power loss can lose "successful" writes.
  • Torn last line only. Because fsync is sequential — every line is durable before the next starts. Only the last in-flight write can be incomplete on crash.
  • Snapshots. Replay time grows with journal size. Periodic checkpoint (write full cache state) + truncate old WAL. Bounded recovery time.
  • Concurrency. Multiple threads compute concurrently because we hold the lock only during cache mutation, not the func() call. The test_compute_outside_lock test exercises this.
04

Trace Events (Stack Samples → Timeline)

common-prefix matching

Profiler gives time-sorted samples of (ts, stack) where stack is outermost → innermost. Output start/end events suitable for a flame graph.

Stage 1 longest common prefix

Key insight: compute LCP between prev and current stacks. Frames at index ≥ LCP in prev ended (deepest first). Frames at index ≥ LCP in current started (outer to inner).

def common_prefix_len(a, b):
    n = 0
    for x, y in zip(a, b):
        if x != y: return n
        n += 1
    return n

def convert_samples_to_events(samples):
    prev = []
    events = []
    for sample in samples:
        cur = sample.stack
        cp = common_prefix_len(prev, cur)
        # end disappeared frames, deepest first
        for i in range(len(prev) - 1, cp - 1, -1):
            events.append(Event("end", sample.ts, prev[i]))
        # start new frames, outer to inner
        for i in range(cp, len(cur)):
            events.append(Event("start", sample.ts, cur[i]))
        prev = cur
    return events
Range gotcha: range(start, stop, -1) is exclusive on stop. To include index cp in the end loop, stop at cp - 1. Start loop is straightforward — range(cp, len(cur)) is naturally exclusive.

Time O(S × D) Space O(D) — S samples, D max stack depth

Stage 2 debounce N consecutive

Per-frame state parallel to prev_stack. Streak count for each frame. Continuing frame: streak += 1 → emit start at streak == N. New frame: streak = 1. Ends only for confirmed frames (streak ≥ N).

def convert_samples_to_debounced_events(samples, n):
    prev, streaks, events = [], [], []
    for sample in samples:
        cur = sample.stack
        cp = common_prefix_len(prev, cur)

        # ends — only for confirmed frames
        for i in range(len(prev) - 1, cp - 1, -1):
            if streaks[i] >= n:
                events.append(Event("end", sample.ts, prev[i]))

        new_streaks = []
        # continuing frames — bump streak, emit start on crossing N
        for i in range(cp):
            s = streaks[i] + 1
            new_streaks.append(s)
            if s == n: events.append(Event("start", sample.ts, cur[i]))
        # new frames — streak = 1, immediate emit only if n == 1
        for i in range(cp, len(cur)):
            new_streaks.append(1)
            if n == 1: events.append(Event("start", sample.ts, cur[i]))

        prev, streaks = cur, new_streaks
    return events
Why position + ancestors matters: ["a", "b"]["c", "b", "a"]. The "b" in sample 2 is NOT a continuation — its parent changed from "a" to "c". Common prefix is 0; streaks reset.

Stage 3 prefix vs suffix comparison

  • Same leaf, different bookkeeping. A: [main, handler, parse, foo] vs B: [main, handler, parse, bar]. Prefix → 1 end + 1 start. Suffix → 4 ends + 4 starts (wrong — outer frames never returned).
  • Same leaf, different root. A: [threadA, dispatch, run, work] vs B: [threadB, dispatch, run, work]. Prefix says everything unwound and re-pushed. Suffix says "worker kept running." For synchronous stacks, suffix is wrong (can't keep inner alive while replacing outer).
  • Why prefix is industry standard. Matches LIFO call-stack semantics. Produces well-nested events that map cleanly to flame graph tree. Suffix can produce intervals where child outlives parent — unrenderable.
  • Where suffix-style IS useful. Aggregation — "self-time in work regardless of caller." Belongs in the analysis layer (over emitted events), not in event emission. Chrome trace, perfetto, pprof, perf — all prefix.
05

Deduplicate Files

multi-stage filter

Given a root dir, return all groups of files with identical content. Recurse. Group by content, not name. Filter singletons.

Stage 1 naive hash everything

import hashlib, os
from collections import defaultdict

def hash_file(path, chunk_size=64*1024):
    h = hashlib.sha256()
    with open(path, "rb") as f:
        while chunk := f.read(chunk_size):  # walrus + chunks
            h.update(chunk)
    return h.hexdigest()

def find_duplicate_files(root_path):
    groups = defaultdict(list)
    for dirpath, _, filenames in os.walk(root_path):
        for f in filenames:
            path = os.path.join(dirpath, f)
            try: groups[hash_file(path)].append(path)
            except OSError: continue  # skip unreadable
    return [g for g in groups.values() if len(g) > 1]

Time O(total bytes) — reads every file in full

Stage 2 three-stage filter: size → partial → full

Each stage filters down the candidate set. Skip singletons between stages so you only pay the next stage's cost on actual candidates.

def hash_partial(path, n_bytes=4*1024):
    with open(path, "rb") as f:
        return hashlib.sha256(f.read(n_bytes)).hexdigest()

def find_duplicate_files_optimized(root_path):
    # Stage A: group by size (pure metadata, no I/O)
    by_size = defaultdict(list)
    for dirpath, _, filenames in os.walk(root_path):
        for f in filenames:
            path = os.path.join(dirpath, f)
            try: by_size[os.path.getsize(path)].append(path)
            except OSError: continue

    # Stage B: group by partial (first 4KB) hash
    by_partial = defaultdict(list)
    for paths in by_size.values():
        if len(paths) < 2: continue
        for p in paths:
            try: by_partial[hash_partial(p)].append(p)
            except OSError: continue

    # Stage C: group by full hash
    by_full = defaultdict(list)
    for paths in by_partial.values():
        if len(paths) < 2: continue
        for p in paths:
            try: by_full[hash_file(p)].append(p)
            except OSError: continue

    return [g for g in by_full.values() if len(g) > 1]

Best O(N files) all unique sizes — metadata only. Worst O(total bytes) all same content. Average: way better than Stage 1.

Stage 3 discussion follow-ups

  • I/O bound vs CPU bound. Profile: CPU idle + high iowait → I/O bound. CPU pegged → CPU bound. Spinning disks: I/O bound. NVMe + small files: CPU bound.
  • Hash function choice. MD5 fast but broken (still fine for non-adversarial dedup). SHA-256 safe default. xxHash / CityHash 10x faster + non-crypto + verify-on-collision. Defensible answer: SHA-256, mention xxHash if profiling shows hash as bottleneck.
  • Distributed implementation. Map: walk + emit (size, path). Shuffle: partition by size so candidates land on same reducer. Reduce: hash within bucket. Never ship file content.
  • Continuous monitoring. inotify / FSEvents. Two tables: hash → [paths], path → hash. Reverse index for cheap deletes. Notify via message queue.
  • Chunk size tuning. 4-64 KB typical. Align with disk block (4 KB). Larger → fewer I/O ops, more memory. Adaptive — scale with file size; skip Stage 2 for tiny files.
06

Distributed Mode and Median

MapReduce-style shuffle

N workers, each with a slice. send(target, data), recv() primitives. Compute global aggregates without funneling all data to one worker (hotspot).

Mode — 4-phase shuffle

The trick: shuffle by key % num_workers so identical keys land on the same worker, but keys distribute. All workers send AND receive in parallel — no hotspot.

Phase 1 — local count
Each worker Counter(local_data). No communication.
Phase 2 — shuffle
Bucket by key % num_workers. Send each bucket to its target. Receive from N-1 others. Keep own bucket.
Phase 3 — aggregate + local top
Sum counts per key (now complete for this worker's owned keys). Pick local max with tie-break.
Phase 4 — reduce to worker 0
Workers 1..N-1 send their one (key, count) tuple. Worker 0 compares N tuples, picks global winner.
from collections import Counter, defaultdict

def find_mode(local_data, worker_id, num_workers, send, recv):
    # Phase 1
    local = Counter(local_data)

    # Phase 2 — shuffle
    buckets = defaultdict(list)
    for k, c in local.items():
        buckets[k % num_workers].append((k, c))
    for t in range(num_workers):
        if t != worker_id: send(t, buckets[t])
    received = list(buckets[worker_id])
    for _ in range(num_workers - 1):
        received.extend(recv())

    # Phase 3
    agg = Counter()
    for k, c in received: agg[k] += c
    if agg:
        # maximize count, then minimize key on tie
        local_top = max(agg.items(), key=lambda kv: (kv[1], -kv[0]))
    else:
        local_top = (None, 0)

    # Phase 4
    if worker_id != 0:
        send(0, local_top)
        return None
    best_k, best_c = local_top
    for _ in range(num_workers - 1):
        rk, rc = recv()
        if rc > best_c or (rc == best_c and rk is not None
                              and (best_k is None or rk < best_k)):
            best_k, best_c = rk, rc
    return best_k

Median — distributed quickselect

Reuse phases 1-2 to build a distributed histogram. Then coordinator-led binary search over the value range.

  • Compute total_count (all → 0 → broadcast).
  • Compute global min/max for the search range.
  • Binary search loop: worker 0 picks pivot; each worker counts local less + equal w.r.t. pivot; worker 0 aggregates + narrows range; broadcast.
  • Terminate when low == high.
# After building histogram + total_count + global_min/max:
target = total_count // 2  # for even count, upper middle
low, high = global_min, global_max
while low < high:
    if worker_id == 0:
        pivot = (low + high) // 2
        for i in range(1, num_workers): send(i, pivot)
    else:
        pivot = recv()

    less = sum(c for v, c in histogram.items() if v < pivot)
    equal = histogram.get(pivot, 0)

    if worker_id == 0:
        tl, te = less, equal
        for _ in range(num_workers - 1):
            rl, re = recv(); tl += rl; te += re
        if target < tl:        new = (low, pivot - 1)
        elif target < tl + te: new = (pivot, pivot)
        else:                  new = (pivot + 1, high)
        for i in range(1, num_workers): send(i, new)
        low, high = new
    else:
        send(0, (less, equal))
        low, high = recv()

return low if worker_id == 0 else None

Mode: O(distinct values) shipped Median: O(log value_range) rounds × tiny messages

Stage 3 discussion follow-ups

  • Data skew. key % num_workers uniform only if keys are uniform. Better hashes: MurmurHash, xxHash. Salted hash defeats adversarial keys. Two-level partitioning + dynamic rebalancing for severe skew.
  • Fault tolerance. If a worker dies mid-phase, others block on recv() forever. Mitigations: heartbeats + coordinator timeout → reassign; checkpoint between phases; replication (K copies); idempotent phases.
  • Communication optimization. Async I/O overlap. Batching (one big message vs many small). Tree reduction (log N hops vs star gather) for large N. Compression. Combine before shuffle (send (key, 1000), not 1000 × (key, 1)).
07

Tokenize / Detokenize

code reading + DP

Convert text ↔ token IDs against a vocab dict. Handle unknowns with an UNK token. Stage 1 = bug-fix the starter. Stage 2 = optimal segmentation via DP.

Stage 1 bug-fix greedy first-match

Bugs in the starter:

  • Unknown char in middle → silently swallows rest of text (key never resets).
  • Trailing unmatched key → silently dropped at end of loop.
  • No UNK emission at all.
  • detokenize crashes with KeyError on tokens not in reversed vocab.

Fix: add the "no possible match" check — if no vocab key starts with the current accumulated key, emit UNK + reset. After loop, if key non-empty, one more UNK.

def tokenize(text, vocab):
    tokens, key = [], ""
    for ch in text:
        key += ch
        if key in vocab:
            tokens.append(vocab[key])
            key = ""
        elif not any(k.startswith(key) for k in vocab):
            tokens.append(vocab.get("UNK", -1))
            key = ""
    if key:
        tokens.append(vocab.get("UNK", -1))
    return tokens

def detokenize(tokens, vocab):
    rev = {v: k for k, v in vocab.items()}
    return "".join(rev.get(t, "?") for t in tokens)
Why use generator (not list comp) in any(...): generators short-circuit on first True — any(k.startswith(key) for k in vocab) stops as soon as one match is found. With any([...]) the list is materialized first → full V scan even if first match is at index 0.

O(N × V × L) — N text length, V vocab size, L avg key length. Trie optimization → O(N × L).

Stage 2 optimal segmentation via DP

Greedy is wrong when longer matches exist. vocab = {"ab":1, "abc":2, "c":3, "UNK":-1}, text = "abc": greedy gives [1, 3] but optimal is [2].

Recurrence: best(i) = optimal segmentation of text[i:]. At each position, try every vocab key matching there; also try UNK (consume 1 char). Compare candidates by (unks, tokens) lexicographically.

from functools import lru_cache

def tokenize_optimal(text, vocab):
    unk_id = vocab.get("UNK", -1)

    @lru_cache(maxsize=None)
    def best(i):
        if i == len(text):
            return (0, 0, ())  # (unks, tokens, segmentation)
        candidates = []
        for k, v in vocab.items():
            if k == "UNK": continue
            if text[i:i+len(k)] == k:
                su, st, ss = best(i + len(k))
                candidates.append((su, st + 1, (v,) + ss))
        # UNK option — consume 1 char
        su, st, ss = best(i + 1)
        candidates.append((su + 1, st + 1, (unk_id,) + ss))
        # lex min: fewer unks first, then fewer tokens
        return min(candidates, key=lambda c: (c[0], c[1]))

    return list(best(0)[2])
Use tuples (not lists) for cached return: lru_cache returns the same object on hit. If anyone mutated the returned list, the cache is corrupted. Tuples are immutable, safe by default. Convert to list only at the final return.

Stage 3 discussion follow-ups

  • Tokenization strategies. Character (no OOV, long sequences). Word (huge vocab, severe OOV). Subword / BPE / WordPiece / SentencePiece — modern default; most words decompose into a few known pieces. Byte-level BPE (GPT-2) — no UNK ever, longest sequences.
  • Trie optimization. Build a prefix tree from the vocab once → walk it char by char during tokenize. O(N × L) instead of O(N × V × L). The two checks ("is current key in vocab?" / "could it still grow?") become O(1) node lookups.
  • Optimal in practice. SentencePiece "unigram" mode uses Viterbi over a unigram LM (pick highest-probability segmentation, not just fewest tokens). For NER / MT, lattice decoding keeps multiple candidates alive.
  • UNK handling in real NLP. Modern subword tokenizers basically eliminate UNK. UNK in training data degrades model quality dramatically. Byte-level BPE is the "no UNK ever" option — and the GPT-style default.

Last-minute reminders before you walk in

Good luck. You've already coded all of these once tonight — your hands know the shape. Trust the prep.