How Does GPT-4o Encode Images?
Experiments on GPT‑4o Image Encoding
- Commenters like the 7×7 colored‑shape grid test; similar experiments with 512×512 images found that an “85‑token” image can yield more text than 85 text tokens.
- People confirm the article’s observation that models can extract >170 text tokens worth of information from a single image and see this as a potential context‑window optimization.
How Images Might Be Tokenized
- Several hypotheses: CNN pyramid with 13×13 feature tiles, ViT‑style encoders, or VQGAN/VQVAE compressing 512×512 into ~13×13 image tokens.
- Others suggest a separate vision encoder that projects tiles into the same embedding space as text tokens (CLIP/LLava‑style).
- Debate over whether image “170 tokens” is real tokens vs an accounting approximation; most assume they do become tokens so attention can work.
Image Generation and Multimodality
- Some argue GPT‑4o must have native image tokens because its claimed image‑generation and fine control surpass what prompting a separate text‑to‑image model could do.
- Others note current ChatGPT behavior still looks like it’s calling DALL·E; there’s disagreement and no definitive proof in the thread.
OCR Performance and Failure Modes
- Multiple reports of excellent OCR on single pages but severe hallucinations on large or multi‑page images, likely due to downscaling/tiling limits.
- Users respond by splitting PDFs into page images and feeding them separately; some are building pipelines around this.
Documentation, Resizing, and Tiling Concerns
- Strong frustration at limited, vague documentation about image handling, especially resizing thresholds, tiling behavior, and how cross‑tile content is treated.
- OpenAI’s stated low/high‑resolution modes (e.g., 1024 vs 2048 with tiling) help, but many practical questions remain “unclear.”
Use of External OCR and Tesseract Debate
- One claim in the article that Tesseract might be in the loop is widely doubted; commenters cite Tesseract’s poor accuracy on handwriting, distortion, and complex layouts.
- Others say combining a rough OCR output with an LLM can improve overall OCR; one anecdote mentions an internal error referencing a Tesseract script.
- Some expect any OCR component to be a modern, in‑house model if used at all.
Tokens, Embeddings, and Future Directions
- Discussion clarifies: text tokens use a lookup table for input embeddings, but images can be mapped by any learned encoder.
- Debate on whether image tokens share the same vocabulary as text, or are marked by “mode” tokens.
- Several speculate that future systems may move from discrete tokens toward more flexible, variable‑length embeddings.
OCR Tools and Ecosystem
- Many note that dedicated cloud OCR APIs outperform open‑source tools and current vision LLMs on reliability and structured extraction, especially for tables.
- Open‑source alternatives mentioned as better than Tesseract include PaddleOCR, SuryaOCR, Doctr, and some multimodal LLMs; experiences vary from “disappointing” to “remarkably good.”
- There is a strong call for a modern, open‑source, non‑LLM OCR system, including for handwriting.
Hallucinations, Reliability, and Use Cases
- Commenters stress that image LLMs can confidently hallucinate plausible but false content, making them risky for data journalism or high‑stakes workflows without verification.
- Suggested stance: treat them like fallible human workers; design processes that detect or tolerate errors, or avoid them where that’s impossible.
Cost, Arbitrage, and Practical Workarounds
- The article’s point that sending text as an image can be cheaper in tokens prompts talk of “arbitrage” by API wrappers, though added image‑generation overhead might erase benefits.
- Some are curious whether response time scales with textual content inside the image, which could indicate hidden OCR steps, but this remains untested and unclear.