Vision language models are blind

Overall reaction to the paper

  • Many find the failure cases “shocking” given claims that VLMs can “understand” images, guide the blind, or tutor children.
  • Others argue the results are not embarrassing for the models but for humans who overinterpret or overmarket them.
  • Several commenters see the title “Vision language models are blind” as hyperbolic or clickbait.

Observed strengths vs weaknesses

  • Strong performance reported on:
    • OCR and handwriting recognition (including non-Latin scripts).
    • Real‑world photos: navigation help for low‑vision users, identifying hardware issues, gardening advice.
  • Very poor or inconsistent performance on:
    • Counting line intersections, overlapping/touching circles.
    • Counting shapes in logos and grids.
    • Following paths in mazes, spatial relationships, reading circled letters, calendar-style highlights.
  • Some small open models and specific prompting styles appear to do better on selected examples, suggesting sensitivity to prompts and setups.

Technical explanations discussed

  • Images are heavily compressed into tokens/embeddings (e.g., patches, CLIP-like encoders), losing fine-grained spatial detail.
  • Embeddings are not trained for faithful reconstruction; visually different images can map to similar vectors.
  • VLMs seem optimized for tasks with abundant training data (captions, OCR) and weak at low-level geometry, counting, and precise spatial reasoning.
  • Some note that models can redraw images reasonably but still fail logical questions about them, implying a reasoning gap, not just “bad eyes.”

Evaluation, training, and generalization

  • Debate over whether these synthetic tasks are “toy tricks” or important evidence of non-general reasoning.
  • Some say failures could be fixed by targeted synthetic training; others warn that patching benchmarks doesn’t address underlying generalization.
  • Counting and spatial relations are known weak spots; auxiliary methods (segmentation, object detection, “set of marks”) can help.

Implications and ethics

  • Concern that marketing for low-vision assistance and “general” vision understanding overstates reliability; calls for stronger safeguards and clearer limitations.
  • Others counter that despite imperfections, current systems are already practically useful and often better than prior tools.