How Does GPT-4o Encode Images?

Experiments on GPT‑4o Image Encoding

  • Commenters like the 7×7 colored‑shape grid test; similar experiments with 512×512 images found that an “85‑token” image can yield more text than 85 text tokens.
  • People confirm the article’s observation that models can extract >170 text tokens worth of information from a single image and see this as a potential context‑window optimization.

How Images Might Be Tokenized

  • Several hypotheses: CNN pyramid with 13×13 feature tiles, ViT‑style encoders, or VQGAN/VQVAE compressing 512×512 into ~13×13 image tokens.
  • Others suggest a separate vision encoder that projects tiles into the same embedding space as text tokens (CLIP/LLava‑style).
  • Debate over whether image “170 tokens” is real tokens vs an accounting approximation; most assume they do become tokens so attention can work.

Image Generation and Multimodality

  • Some argue GPT‑4o must have native image tokens because its claimed image‑generation and fine control surpass what prompting a separate text‑to‑image model could do.
  • Others note current ChatGPT behavior still looks like it’s calling DALL·E; there’s disagreement and no definitive proof in the thread.

OCR Performance and Failure Modes

  • Multiple reports of excellent OCR on single pages but severe hallucinations on large or multi‑page images, likely due to downscaling/tiling limits.
  • Users respond by splitting PDFs into page images and feeding them separately; some are building pipelines around this.

Documentation, Resizing, and Tiling Concerns

  • Strong frustration at limited, vague documentation about image handling, especially resizing thresholds, tiling behavior, and how cross‑tile content is treated.
  • OpenAI’s stated low/high‑resolution modes (e.g., 1024 vs 2048 with tiling) help, but many practical questions remain “unclear.”

Use of External OCR and Tesseract Debate

  • One claim in the article that Tesseract might be in the loop is widely doubted; commenters cite Tesseract’s poor accuracy on handwriting, distortion, and complex layouts.
  • Others say combining a rough OCR output with an LLM can improve overall OCR; one anecdote mentions an internal error referencing a Tesseract script.
  • Some expect any OCR component to be a modern, in‑house model if used at all.

Tokens, Embeddings, and Future Directions

  • Discussion clarifies: text tokens use a lookup table for input embeddings, but images can be mapped by any learned encoder.
  • Debate on whether image tokens share the same vocabulary as text, or are marked by “mode” tokens.
  • Several speculate that future systems may move from discrete tokens toward more flexible, variable‑length embeddings.

OCR Tools and Ecosystem

  • Many note that dedicated cloud OCR APIs outperform open‑source tools and current vision LLMs on reliability and structured extraction, especially for tables.
  • Open‑source alternatives mentioned as better than Tesseract include PaddleOCR, SuryaOCR, Doctr, and some multimodal LLMs; experiences vary from “disappointing” to “remarkably good.”
  • There is a strong call for a modern, open‑source, non‑LLM OCR system, including for handwriting.

Hallucinations, Reliability, and Use Cases

  • Commenters stress that image LLMs can confidently hallucinate plausible but false content, making them risky for data journalism or high‑stakes workflows without verification.
  • Suggested stance: treat them like fallible human workers; design processes that detect or tolerate errors, or avoid them where that’s impossible.

Cost, Arbitrage, and Practical Workarounds

  • The article’s point that sending text as an image can be cheaper in tokens prompts talk of “arbitrage” by API wrappers, though added image‑generation overhead might erase benefits.
  • Some are curious whether response time scales with textual content inside the image, which could indicate hidden OCR steps, but this remains untested and unclear.