The last six months in LLMs, illustrated by pelicans on bicycles
Purpose and limits of the “pelican on a bicycle” test
- Thread agrees this is an intentionally inappropriate task for text-only LLMs: they must write SVG code for a novel scene with no visual feedback.
- Defenders say that’s the point: it stress-tests following a spec, compositionality, and abstract visualization, a bit like LOGO or CAD instructions.
- Critics argue it’s a poor proxy for real engineering or design work, which depends on tacit knowledge, real-world constraints, and nuanced communication that aren’t online as training data.
- Many see it primarily as a humorous, hype-deflating benchmark rather than a serious metric.
Quality, cost, and when to use LLMs
- One camp: the outputs show LLMs are “all terrible” for creative/technical work and you should hire professionals.
- Another: LLMs are “go-karts of the mind”—cheap, low-end tools that are “good enough” for many tasks where a Porsche-quality result isn’t needed.
- Practical suggestions: for vector art, use image models (Midjourney, etc.) plus auto-vectorization instead of asking text models to hand-write SVG.
- Consensus that writing complex SVG from scratch is hard even for humans; models are still much cheaper and faster, if you accept mediocre quality.
Benchmark methodology and contamination
- Multiple complaints about evaluating probabilistic models from a single sample; calls for many runs and averaging.
- Others counter that “one-shot” reflects how most users actually experience models and avoids human cherry-picking.
- Concerns about using a single LLM as the judge; suggestions include human crowds, experts, and multiple models as evaluators.
- As the pelican prompt spreads (talks, interviews, Google I/O), people worry it will leak into training data and be directly optimized against, reducing its value.
- Some suggest rotating or hidden benchmarks (e.g., ARC Prize–style tasks, hashed prompts).
Humans vs LLMs on bikes and pelicans
- References to projects where ordinary people draw impossible bicycles, showing humans also lack precise structural knowledge.
- Disagreement whether “average human” still outperforms current models on basic correctness (wheels, chain, pedals) given time and references.
- Cost comparisons: a human drawing from scratch vs thousands of model generations plus automated ranking.
Broader context: tools, hype, and safety
- Mentions of better vector-ish tools (e.g., Recraft) and a Kaggle SVG competition that got strong results with specialized setups.
- Discussion of mainstream virality of ChatGPT image generation (Ghibli-style portraits), with some downplaying it as fad and others seeing durable adoption.
- Safety concerns around models “snitching” on wrongdoing (SnitchBench), agentic access to tools, prompt injection, and opaque memory features reducing user control.