2025-10-21

Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

Pixels vs. Text as LLM Input

Core idea discussed: render all text to images and feed only visual tokens into models, effectively “killing the tokenizer.”
Clarification: users wouldn’t hand‑draw questions; text would be rasterized automatically (like how screens already display text as pixels).
Some see this as simply moving tokenization inside the vision encoder rather than eliminating it.

Tokenization & Compute Tradeoffs

Broad agreement that current tokenizers are crude and lossy abstractions, but very efficient.
Removing or radically changing tokenization tends to require much more compute and parameters for modest gains, which is a practical blocker at scale.
Character/byte-level models are cited as examples: more precise but sharply increase compute and shrink usable context.

Information Density & Compression

DeepSeek-OCR and related “Glyph” work suggest visual-text tokens can pack more context per token than BPE text tokens, at some quality cost.
Idea: learned visual encoders map patches into a richer, denser embedding space than a fixed lookup table of text tokens.
Several note this is less “pixels beat text” and more “this particular representation beats this particular tokenizer.”

Scripts, Semantics, and OCR

Logographic scripts (e.g., Chinese characters) may make visual encodings more natural, since glyph shapes carry semantic relations that plain UTF-8 obscures.
Some speculate OCR-style encoders may especially help languages without clear word boundaries.
Others emphasize that bitwise precision (Unicode, domain names, code) still demands text-level handling.

Human Reading & Multimodality

Long subthread on how humans read: mostly linear but with saccades, skimming, and parallel “threads” of interpretation.
Used as an analogy for why vision-based or multimodal “percels” (combined perceptual units) might be a more brain-like substrate than discrete text tokens.

Use Cases, Limits, and Skepticism

Concerns:
- Image inputs for code or binary data likely problematic due to precision needs.
- OCR-trained encoders might not transfer cleanly to general reasoning.
Others point to strong OCR performance and document understanding as evidence that pixel-based contexts can already rival text pipelines in practice.

Architecture Experiments & Humor

Discussion ties into broader pushes to remove hand-engineered features and let large networks learn their own representations.
Neologisms like “percels” and jokes about PowerPoint, Paint, printed pages, and interpretive dance highlight both interest and skepticism toward “pixels everywhere.”

Related topics