The last six months in LLMs, illustrated by pelicans on bicycles

Purpose and limits of the “pelican on a bicycle” test

  • Thread agrees this is an intentionally inappropriate task for text-only LLMs: they must write SVG code for a novel scene with no visual feedback.
  • Defenders say that’s the point: it stress-tests following a spec, compositionality, and abstract visualization, a bit like LOGO or CAD instructions.
  • Critics argue it’s a poor proxy for real engineering or design work, which depends on tacit knowledge, real-world constraints, and nuanced communication that aren’t online as training data.
  • Many see it primarily as a humorous, hype-deflating benchmark rather than a serious metric.

Quality, cost, and when to use LLMs

  • One camp: the outputs show LLMs are “all terrible” for creative/technical work and you should hire professionals.
  • Another: LLMs are “go-karts of the mind”—cheap, low-end tools that are “good enough” for many tasks where a Porsche-quality result isn’t needed.
  • Practical suggestions: for vector art, use image models (Midjourney, etc.) plus auto-vectorization instead of asking text models to hand-write SVG.
  • Consensus that writing complex SVG from scratch is hard even for humans; models are still much cheaper and faster, if you accept mediocre quality.

Benchmark methodology and contamination

  • Multiple complaints about evaluating probabilistic models from a single sample; calls for many runs and averaging.
  • Others counter that “one-shot” reflects how most users actually experience models and avoids human cherry-picking.
  • Concerns about using a single LLM as the judge; suggestions include human crowds, experts, and multiple models as evaluators.
  • As the pelican prompt spreads (talks, interviews, Google I/O), people worry it will leak into training data and be directly optimized against, reducing its value.
  • Some suggest rotating or hidden benchmarks (e.g., ARC Prize–style tasks, hashed prompts).

Humans vs LLMs on bikes and pelicans

  • References to projects where ordinary people draw impossible bicycles, showing humans also lack precise structural knowledge.
  • Disagreement whether “average human” still outperforms current models on basic correctness (wheels, chain, pedals) given time and references.
  • Cost comparisons: a human drawing from scratch vs thousands of model generations plus automated ranking.

Broader context: tools, hype, and safety

  • Mentions of better vector-ish tools (e.g., Recraft) and a Kaggle SVG competition that got strong results with specialized setups.
  • Discussion of mainstream virality of ChatGPT image generation (Ghibli-style portraits), with some downplaying it as fad and others seeing durable adoption.
  • Safety concerns around models “snitching” on wrongdoing (SnitchBench), agentic access to tools, prompt injection, and opaque memory features reducing user control.