Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

Pixels vs. Text as LLM Input

  • Core idea discussed: render all text to images and feed only visual tokens into models, effectively “killing the tokenizer.”
  • Clarification: users wouldn’t hand‑draw questions; text would be rasterized automatically (like how screens already display text as pixels).
  • Some see this as simply moving tokenization inside the vision encoder rather than eliminating it.

Tokenization & Compute Tradeoffs

  • Broad agreement that current tokenizers are crude and lossy abstractions, but very efficient.
  • Removing or radically changing tokenization tends to require much more compute and parameters for modest gains, which is a practical blocker at scale.
  • Character/byte-level models are cited as examples: more precise but sharply increase compute and shrink usable context.

Information Density & Compression

  • DeepSeek-OCR and related “Glyph” work suggest visual-text tokens can pack more context per token than BPE text tokens, at some quality cost.
  • Idea: learned visual encoders map patches into a richer, denser embedding space than a fixed lookup table of text tokens.
  • Several note this is less “pixels beat text” and more “this particular representation beats this particular tokenizer.”

Scripts, Semantics, and OCR

  • Logographic scripts (e.g., Chinese characters) may make visual encodings more natural, since glyph shapes carry semantic relations that plain UTF-8 obscures.
  • Some speculate OCR-style encoders may especially help languages without clear word boundaries.
  • Others emphasize that bitwise precision (Unicode, domain names, code) still demands text-level handling.

Human Reading & Multimodality

  • Long subthread on how humans read: mostly linear but with saccades, skimming, and parallel “threads” of interpretation.
  • Used as an analogy for why vision-based or multimodal “percels” (combined perceptual units) might be a more brain-like substrate than discrete text tokens.

Use Cases, Limits, and Skepticism

  • Concerns:
    • Image inputs for code or binary data likely problematic due to precision needs.
    • OCR-trained encoders might not transfer cleanly to general reasoning.
  • Others point to strong OCR performance and document understanding as evidence that pixel-based contexts can already rival text pipelines in practice.

Architecture Experiments & Humor

  • Discussion ties into broader pushes to remove hand-engineered features and let large networks learn their own representations.
  • Neologisms like “percels” and jokes about PowerPoint, Paint, printed pages, and interpretive dance highlight both interest and skepticism toward “pixels everywhere.”