2024-06-07

How Does GPT-4o Encode Images?

Experiments on GPT‑4o Image Encoding

Commenters like the 7×7 colored‑shape grid test; similar experiments with 512×512 images found that an “85‑token” image can yield more text than 85 text tokens.
People confirm the article’s observation that models can extract >170 text tokens worth of information from a single image and see this as a potential context‑window optimization.

How Images Might Be Tokenized

Several hypotheses: CNN pyramid with 13×13 feature tiles, ViT‑style encoders, or VQGAN/VQVAE compressing 512×512 into ~13×13 image tokens.
Others suggest a separate vision encoder that projects tiles into the same embedding space as text tokens (CLIP/LLava‑style).
Debate over whether image “170 tokens” is real tokens vs an accounting approximation; most assume they do become tokens so attention can work.

Image Generation and Multimodality

Some argue GPT‑4o must have native image tokens because its claimed image‑generation and fine control surpass what prompting a separate text‑to‑image model could do.
Others note current ChatGPT behavior still looks like it’s calling DALL·E; there’s disagreement and no definitive proof in the thread.

OCR Performance and Failure Modes

Multiple reports of excellent OCR on single pages but severe hallucinations on large or multi‑page images, likely due to downscaling/tiling limits.
Users respond by splitting PDFs into page images and feeding them separately; some are building pipelines around this.

Documentation, Resizing, and Tiling Concerns

Strong frustration at limited, vague documentation about image handling, especially resizing thresholds, tiling behavior, and how cross‑tile content is treated.
OpenAI’s stated low/high‑resolution modes (e.g., 1024 vs 2048 with tiling) help, but many practical questions remain “unclear.”

Use of External OCR and Tesseract Debate

One claim in the article that Tesseract might be in the loop is widely doubted; commenters cite Tesseract’s poor accuracy on handwriting, distortion, and complex layouts.
Others say combining a rough OCR output with an LLM can improve overall OCR; one anecdote mentions an internal error referencing a Tesseract script.
Some expect any OCR component to be a modern, in‑house model if used at all.

Tokens, Embeddings, and Future Directions

Discussion clarifies: text tokens use a lookup table for input embeddings, but images can be mapped by any learned encoder.
Debate on whether image tokens share the same vocabulary as text, or are marked by “mode” tokens.
Several speculate that future systems may move from discrete tokens toward more flexible, variable‑length embeddings.

OCR Tools and Ecosystem

Many note that dedicated cloud OCR APIs outperform open‑source tools and current vision LLMs on reliability and structured extraction, especially for tables.
Open‑source alternatives mentioned as better than Tesseract include PaddleOCR, SuryaOCR, Doctr, and some multimodal LLMs; experiences vary from “disappointing” to “remarkably good.”
There is a strong call for a modern, open‑source, non‑LLM OCR system, including for handwriting.

Hallucinations, Reliability, and Use Cases

Commenters stress that image LLMs can confidently hallucinate plausible but false content, making them risky for data journalism or high‑stakes workflows without verification.
Suggested stance: treat them like fallible human workers; design processes that detect or tolerate errors, or avoid them where that’s impossible.

Cost, Arbitrage, and Practical Workarounds

The article’s point that sending text as an image can be cheaper in tokens prompts talk of “arbitrage” by API wrappers, though added image‑generation overhead might erase benefits.
Some are curious whether response time scales with textual content inside the image, which could indicate hidden OCR steps, but this remains untested and unclear.

Related topics