2024-08-17

Are you better than a language model at predicting the next word?

Game concept and mechanics

Quiz asks humans to predict the next word in real Hacker News comments, framed as a competition with several LLMs and a unigram baseline.
Incorrect options are generated by an LLM (e.g., llama2); each model then “chooses” among the four options by picking the completion with lowest perplexity.
Temperature is not used for the choice; instruction‑tuned/chatbot behavior is largely avoided to reduce “voice” bias.
“Correct” means “the word that actually appeared next in the original HN comment,” not any semantically plausible word.

User experience and clarity

Several people enjoy the idea and call it clever or fun, but many find the quiz longer or more tedious than expected.
Suggestions include: show one question at a time, provide instant feedback, clarify upfront that prompts come from HN and that “correct” = original commenter’s word.
Some confusion arises from very short or clipped prompts (e.g., only a symbol or a single word), which feel like pure guessing.

Difficulty, scores, and strategies

Reported human scores vary widely (e.g., 2/15 to 11/15; 28/100), often near or slightly above random choice.
LLM scores also vary significantly per run and per model; sometimes humans beat the best model, sometimes not, and the unigram baseline occasionally does surprisingly well.
The author reports that on 1,000 questions, LLMs get ~30–35% correct; patient humans can reach ~40–50%.
Some users notice they do better on longer prompts or when they recognize specific HN comments.
A proposed strategy is to pick the “outlier” option, assuming distractors are LLM‑like; effectiveness is unclear.
Several people argue that a scoring method based on ranking / probabilities (cross‑entropy, likelihood) would be more meaningful than strict right/wrong.

What is really being measured?

Many emphasize that the quiz measures alignment with one specific human’s next word, not objective correctness or “intelligence.”
Multiple options are often grammatically and semantically valid; choosing which is “right” is somewhat arbitrary.
Some argue this shows the limits of next‑word prediction as a proxy for “smartness”; others say that is precisely the point—LLMs are statistical next‑token models, not deep reasoners.

Methodological and training-data concerns

Questions are raised about whether HN comments used might be in the training data of some models, and whether that biases results.
Using older comments is suggested but might increase training-set overlap; using newer ones avoids that but cannot fully be verified.
There is discussion of how base models differ from instruction‑tuned chatbots, how “style” emerges, and how beam search can approximate “going back” on predictions.

Broader reflections on LLMs and intelligence

Commenters contrast statistical pattern matching (LLMs) with symbolic reasoning and general intelligence, noting that success at next‑word prediction doesn’t imply AGI.
Others note that humans bring fresh experiences, long‑term goals, and social actions (e.g., politics, organizing, inviting someone for coffee) that LLMs currently lack.
Some see the quiz as a humorous inversion of “look, I broke the AI” posts and as a good teaching tool for how LLMs actually operate.

Related topics