Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?
Pixels vs. Text as LLM Input
- Core idea discussed: render all text to images and feed only visual tokens into models, effectively “killing the tokenizer.”
- Clarification: users wouldn’t hand‑draw questions; text would be rasterized automatically (like how screens already display text as pixels).
- Some see this as simply moving tokenization inside the vision encoder rather than eliminating it.
Tokenization & Compute Tradeoffs
- Broad agreement that current tokenizers are crude and lossy abstractions, but very efficient.
- Removing or radically changing tokenization tends to require much more compute and parameters for modest gains, which is a practical blocker at scale.
- Character/byte-level models are cited as examples: more precise but sharply increase compute and shrink usable context.
Information Density & Compression
- DeepSeek-OCR and related “Glyph” work suggest visual-text tokens can pack more context per token than BPE text tokens, at some quality cost.
- Idea: learned visual encoders map patches into a richer, denser embedding space than a fixed lookup table of text tokens.
- Several note this is less “pixels beat text” and more “this particular representation beats this particular tokenizer.”
Scripts, Semantics, and OCR
- Logographic scripts (e.g., Chinese characters) may make visual encodings more natural, since glyph shapes carry semantic relations that plain UTF-8 obscures.
- Some speculate OCR-style encoders may especially help languages without clear word boundaries.
- Others emphasize that bitwise precision (Unicode, domain names, code) still demands text-level handling.
Human Reading & Multimodality
- Long subthread on how humans read: mostly linear but with saccades, skimming, and parallel “threads” of interpretation.
- Used as an analogy for why vision-based or multimodal “percels” (combined perceptual units) might be a more brain-like substrate than discrete text tokens.
Use Cases, Limits, and Skepticism
- Concerns:
- Image inputs for code or binary data likely problematic due to precision needs.
- OCR-trained encoders might not transfer cleanly to general reasoning.
- Others point to strong OCR performance and document understanding as evidence that pixel-based contexts can already rival text pipelines in practice.
Architecture Experiments & Humor
- Discussion ties into broader pushes to remove hand-engineered features and let large networks learn their own representations.
- Neologisms like “percels” and jokes about PowerPoint, Paint, printed pages, and interpretive dance highlight both interest and skepticism toward “pixels everywhere.”