2025-06-08

The last six months in LLMs, illustrated by pelicans on bicycles

Purpose and limits of the “pelican on a bicycle” test

Thread agrees this is an intentionally inappropriate task for text-only LLMs: they must write SVG code for a novel scene with no visual feedback.
Defenders say that’s the point: it stress-tests following a spec, compositionality, and abstract visualization, a bit like LOGO or CAD instructions.
Critics argue it’s a poor proxy for real engineering or design work, which depends on tacit knowledge, real-world constraints, and nuanced communication that aren’t online as training data.
Many see it primarily as a humorous, hype-deflating benchmark rather than a serious metric.

Quality, cost, and when to use LLMs

One camp: the outputs show LLMs are “all terrible” for creative/technical work and you should hire professionals.
Another: LLMs are “go-karts of the mind”—cheap, low-end tools that are “good enough” for many tasks where a Porsche-quality result isn’t needed.
Practical suggestions: for vector art, use image models (Midjourney, etc.) plus auto-vectorization instead of asking text models to hand-write SVG.
Consensus that writing complex SVG from scratch is hard even for humans; models are still much cheaper and faster, if you accept mediocre quality.

Benchmark methodology and contamination

Multiple complaints about evaluating probabilistic models from a single sample; calls for many runs and averaging.
Others counter that “one-shot” reflects how most users actually experience models and avoids human cherry-picking.
Concerns about using a single LLM as the judge; suggestions include human crowds, experts, and multiple models as evaluators.
As the pelican prompt spreads (talks, interviews, Google I/O), people worry it will leak into training data and be directly optimized against, reducing its value.
Some suggest rotating or hidden benchmarks (e.g., ARC Prize–style tasks, hashed prompts).

Humans vs LLMs on bikes and pelicans

References to projects where ordinary people draw impossible bicycles, showing humans also lack precise structural knowledge.
Disagreement whether “average human” still outperforms current models on basic correctness (wheels, chain, pedals) given time and references.
Cost comparisons: a human drawing from scratch vs thousands of model generations plus automated ranking.

Broader context: tools, hype, and safety

Mentions of better vector-ish tools (e.g., Recraft) and a Kaggle SVG competition that got strong results with specialized setups.
Discussion of mainstream virality of ChatGPT image generation (Ghibli-style portraits), with some downplaying it as fad and others seeing durable adoption.
Safety concerns around models “snitching” on wrongdoing (SnitchBench), agentic access to tools, prompt injection, and opaque memory features reducing user control.

Related topics