Are you better than a language model at predicting the next word?

Game concept and mechanics

  • Quiz asks humans to predict the next word in real Hacker News comments, framed as a competition with several LLMs and a unigram baseline.
  • Incorrect options are generated by an LLM (e.g., llama2); each model then “chooses” among the four options by picking the completion with lowest perplexity.
  • Temperature is not used for the choice; instruction‑tuned/chatbot behavior is largely avoided to reduce “voice” bias.
  • “Correct” means “the word that actually appeared next in the original HN comment,” not any semantically plausible word.

User experience and clarity

  • Several people enjoy the idea and call it clever or fun, but many find the quiz longer or more tedious than expected.
  • Suggestions include: show one question at a time, provide instant feedback, clarify upfront that prompts come from HN and that “correct” = original commenter’s word.
  • Some confusion arises from very short or clipped prompts (e.g., only a symbol or a single word), which feel like pure guessing.

Difficulty, scores, and strategies

  • Reported human scores vary widely (e.g., 2/15 to 11/15; 28/100), often near or slightly above random choice.
  • LLM scores also vary significantly per run and per model; sometimes humans beat the best model, sometimes not, and the unigram baseline occasionally does surprisingly well.
  • The author reports that on 1,000 questions, LLMs get ~30–35% correct; patient humans can reach ~40–50%.
  • Some users notice they do better on longer prompts or when they recognize specific HN comments.
  • A proposed strategy is to pick the “outlier” option, assuming distractors are LLM‑like; effectiveness is unclear.
  • Several people argue that a scoring method based on ranking / probabilities (cross‑entropy, likelihood) would be more meaningful than strict right/wrong.

What is really being measured?

  • Many emphasize that the quiz measures alignment with one specific human’s next word, not objective correctness or “intelligence.”
  • Multiple options are often grammatically and semantically valid; choosing which is “right” is somewhat arbitrary.
  • Some argue this shows the limits of next‑word prediction as a proxy for “smartness”; others say that is precisely the point—LLMs are statistical next‑token models, not deep reasoners.

Methodological and training-data concerns

  • Questions are raised about whether HN comments used might be in the training data of some models, and whether that biases results.
  • Using older comments is suggested but might increase training-set overlap; using newer ones avoids that but cannot fully be verified.
  • There is discussion of how base models differ from instruction‑tuned chatbots, how “style” emerges, and how beam search can approximate “going back” on predictions.

Broader reflections on LLMs and intelligence

  • Commenters contrast statistical pattern matching (LLMs) with symbolic reasoning and general intelligence, noting that success at next‑word prediction doesn’t imply AGI.
  • Others note that humans bring fresh experiences, long‑term goals, and social actions (e.g., politics, organizing, inviting someone for coffee) that LLMs currently lack.
  • Some see the quiz as a humorous inversion of “look, I broke the AI” posts and as a good teaching tool for how LLMs actually operate.