LLMs tell bad jokes because they avoid surprises

Surprise, probability, and training

  • Many commenters like the “surprising but inevitable” framing of jokes, and connect it to LLM training minimizing perplexity (surprise) on text.
  • Others push back: pretraining on next-token prediction doesn’t inherently penalize surprise at the sequence level; the “best” joke continuation could be globally likely even if some individual tokens are low probability.
  • Temperature and decoding are highlighted: low temperature + safety finetuning bias toward bland, unsurprising text; but simply increasing temperature doesn’t reliably make jokes better, just weirder.
  • Some argue the article conflates token-level likelihood with human-level “surprise” and over-psychologizes cross‑entropy minimization.

Safety, RLHF, and guardrails

  • Several note that production models are heavily tuned for factuality and safety, which cuts off many joke modes (edgy, transgressive, or absurd).
  • This tuning also encourages explicit meta-commentary (“this is a joke…”), which ruins timing and immersion.
  • People suspect some “canned” jokes are hard‑wired for evaluations, and that models revert to safe, overused material without careful prompting.

Difficulty of humor & human comparison

  • A recurring theme: good original jokes are extremely hard even for humans; comparing LLMs to professional comedians is an unfair benchmark.
  • Comparisons are made to children’s jokes and anti‑jokes: kids and LLMs both often get the structure but miss the sharp, specific twist.
  • Some say current top models can reach “junior comic / open‑mic” quality on niche prompts, with maybe 10–20% of lines landing. Others still find them flat or derivative.

Humor theory, structure, and culture

  • Commenters reference incongruity theory: humor arises when a punchline forces a reinterpretation of the setup. Ambiguity and “frame shifts” (e.g., “alleged killer whale”) are central.
  • Others emphasize “obviousness”: the funniest lines often state the most salient but unspoken thought, not the cleverest one. LLMs tend to be too generic and non‑committal to do this well.
  • Several note cultural and linguistic differences (e.g., pun density in English vs French, haiku cutting words) as further complications for generalized joke generation.

Proposals and experiments

  • Ideas include: an explicit “Surprise Mode,” searching candidate continuations for contradictions, and building humor‑specialized models.
  • Many share prompt experiments (HN roasts, “Why did the sun climb a tree?”, man/dog jokes), illustrating that models can sometimes be genuinely funny but are inconsistent and often recycle known material.