Vision language models are blind
Overall reaction to the paper
- Many find the failure cases “shocking” given claims that VLMs can “understand” images, guide the blind, or tutor children.
- Others argue the results are not embarrassing for the models but for humans who overinterpret or overmarket them.
- Several commenters see the title “Vision language models are blind” as hyperbolic or clickbait.
Observed strengths vs weaknesses
- Strong performance reported on:
- OCR and handwriting recognition (including non-Latin scripts).
- Real‑world photos: navigation help for low‑vision users, identifying hardware issues, gardening advice.
- Very poor or inconsistent performance on:
- Counting line intersections, overlapping/touching circles.
- Counting shapes in logos and grids.
- Following paths in mazes, spatial relationships, reading circled letters, calendar-style highlights.
- Some small open models and specific prompting styles appear to do better on selected examples, suggesting sensitivity to prompts and setups.
Technical explanations discussed
- Images are heavily compressed into tokens/embeddings (e.g., patches, CLIP-like encoders), losing fine-grained spatial detail.
- Embeddings are not trained for faithful reconstruction; visually different images can map to similar vectors.
- VLMs seem optimized for tasks with abundant training data (captions, OCR) and weak at low-level geometry, counting, and precise spatial reasoning.
- Some note that models can redraw images reasonably but still fail logical questions about them, implying a reasoning gap, not just “bad eyes.”
Evaluation, training, and generalization
- Debate over whether these synthetic tasks are “toy tricks” or important evidence of non-general reasoning.
- Some say failures could be fixed by targeted synthetic training; others warn that patching benchmarks doesn’t address underlying generalization.
- Counting and spatial relations are known weak spots; auxiliary methods (segmentation, object detection, “set of marks”) can help.
Implications and ethics
- Concern that marketing for low-vision assistance and “general” vision understanding overstates reliability; calls for stronger safeguards and clearer limitations.
- Others counter that despite imperfections, current systems are already practically useful and often better than prior tools.