Are you better than a language model at predicting the next word?
Game concept and mechanics
- Quiz asks humans to predict the next word in real Hacker News comments, framed as a competition with several LLMs and a unigram baseline.
- Incorrect options are generated by an LLM (e.g., llama2); each model then “chooses” among the four options by picking the completion with lowest perplexity.
- Temperature is not used for the choice; instruction‑tuned/chatbot behavior is largely avoided to reduce “voice” bias.
- “Correct” means “the word that actually appeared next in the original HN comment,” not any semantically plausible word.
User experience and clarity
- Several people enjoy the idea and call it clever or fun, but many find the quiz longer or more tedious than expected.
- Suggestions include: show one question at a time, provide instant feedback, clarify upfront that prompts come from HN and that “correct” = original commenter’s word.
- Some confusion arises from very short or clipped prompts (e.g., only a symbol or a single word), which feel like pure guessing.
Difficulty, scores, and strategies
- Reported human scores vary widely (e.g., 2/15 to 11/15; 28/100), often near or slightly above random choice.
- LLM scores also vary significantly per run and per model; sometimes humans beat the best model, sometimes not, and the unigram baseline occasionally does surprisingly well.
- The author reports that on 1,000 questions, LLMs get ~30–35% correct; patient humans can reach ~40–50%.
- Some users notice they do better on longer prompts or when they recognize specific HN comments.
- A proposed strategy is to pick the “outlier” option, assuming distractors are LLM‑like; effectiveness is unclear.
- Several people argue that a scoring method based on ranking / probabilities (cross‑entropy, likelihood) would be more meaningful than strict right/wrong.
What is really being measured?
- Many emphasize that the quiz measures alignment with one specific human’s next word, not objective correctness or “intelligence.”
- Multiple options are often grammatically and semantically valid; choosing which is “right” is somewhat arbitrary.
- Some argue this shows the limits of next‑word prediction as a proxy for “smartness”; others say that is precisely the point—LLMs are statistical next‑token models, not deep reasoners.
Methodological and training-data concerns
- Questions are raised about whether HN comments used might be in the training data of some models, and whether that biases results.
- Using older comments is suggested but might increase training-set overlap; using newer ones avoids that but cannot fully be verified.
- There is discussion of how base models differ from instruction‑tuned chatbots, how “style” emerges, and how beam search can approximate “going back” on predictions.
Broader reflections on LLMs and intelligence
- Commenters contrast statistical pattern matching (LLMs) with symbolic reasoning and general intelligence, noting that success at next‑word prediction doesn’t imply AGI.
- Others note that humans bring fresh experiences, long‑term goals, and social actions (e.g., politics, organizing, inviting someone for coffee) that LLMs currently lack.
- Some see the quiz as a humorous inversion of “look, I broke the AI” posts and as a good teaching tool for how LLMs actually operate.