2024-07-10

Vision language models are blind

Overall reaction to the paper

Many find the failure cases “shocking” given claims that VLMs can “understand” images, guide the blind, or tutor children.
Others argue the results are not embarrassing for the models but for humans who overinterpret or overmarket them.
Several commenters see the title “Vision language models are blind” as hyperbolic or clickbait.

Observed strengths vs weaknesses

Strong performance reported on:
- OCR and handwriting recognition (including non-Latin scripts).
- Real‑world photos: navigation help for low‑vision users, identifying hardware issues, gardening advice.
Very poor or inconsistent performance on:
- Counting line intersections, overlapping/touching circles.
- Counting shapes in logos and grids.
- Following paths in mazes, spatial relationships, reading circled letters, calendar-style highlights.
Some small open models and specific prompting styles appear to do better on selected examples, suggesting sensitivity to prompts and setups.

Technical explanations discussed

Images are heavily compressed into tokens/embeddings (e.g., patches, CLIP-like encoders), losing fine-grained spatial detail.
Embeddings are not trained for faithful reconstruction; visually different images can map to similar vectors.
VLMs seem optimized for tasks with abundant training data (captions, OCR) and weak at low-level geometry, counting, and precise spatial reasoning.
Some note that models can redraw images reasonably but still fail logical questions about them, implying a reasoning gap, not just “bad eyes.”

Evaluation, training, and generalization

Debate over whether these synthetic tasks are “toy tricks” or important evidence of non-general reasoning.
Some say failures could be fixed by targeted synthetic training; others warn that patching benchmarks doesn’t address underlying generalization.
Counting and spatial relations are known weak spots; auxiliary methods (segmentation, object detection, “set of marks”) can help.

Implications and ethics

Concern that marketing for low-vision assistance and “general” vision understanding overstates reliability; calls for stronger safeguards and clearer limitations.
Others counter that despite imperfections, current systems are already practically useful and often better than prior tools.

Related topics