Is 2026 next year?

LLM and Google AI Failures on a Trivial Date Question

  • Multiple models (Google’s AI Overview, ChatGPT, Claude Haiku, some open-source LLMs) give self-contradictory or flatly wrong answers to “Is 2026 next year?” despite being given the correct current year in context.
  • Some models initially say “no” then immediately explain reasoning that implies “yes,” or flip mid-answer.
  • Others answer correctly but only after an extra reasoning pass or with “extended thinking” enabled.

User Experience: Arguing with a Token Generator

  • Several comments describe the same pattern when correcting LLM errors:
    • First, the model confidently deflects or reframes the user’s correction.
    • Then, when quoted verbatim, it apologizes profusely and restates the user’s explanation at length.
  • This wastes context and tokens and makes the conversation unusable; users feel they must find “magic words” to force the model to simply fix the bug.
  • Some argue that asking an LLM to “explain why it was wrong” is misguided: it’s just generating new tokens, not introspecting on prior output.

Do LLMs ‘Think’ or Have Knowledge?

  • One side: LLMs lack critical thinking, logic, skepticism, self-reflection, common sense, and in-session learning; they are sophisticated text predictors, not reasoning agents.
  • Others counter that large models exhibit internal structures and behaviors suggestive of world knowledge and some form of intelligence, blurring lines with human cognition.
  • There is debate over definitions of “intelligence,” whether next-token prediction alone can solve novel problems, and how this differs from human mental models.

Reliability, Usefulness, and Scope

  • Some conclude these tools should not be trusted for anything that truly matters if they fail even basic date arithmetic.
  • Others (some sarcastically) claim AI is still an “industrial revolution” for productivity, though critics say its main safe use today is summarization or boilerplate generation.
  • Disagreement over whether LLMs genuinely solve novel problems (e.g., coding, math) versus just remixing common solutions.

Google Search, Brand, and Feedback Loop

  • Concern that Google is shipping weaker models in AI Overviews, harming perceived quality but accepted as “enshittification.”
  • Some now use search as an AI prompt interface despite poor accuracy.
  • Geographic variation: some regions see no AI Overview, only this HN thread as top result.
  • Worry that AI systems will now train on this noisy thread itself, amplifying confusion (a kind of “generation loss”).

Technical Explanations and Mitigations

  • One analysis: yes/no framing plus training data mostly from earlier years biases models toward wrong answers without explicit pre-reasoning.
  • Another view: the bug stems from conflicting signals between the model’s training-cutoff “world” and injected current-date/system prompts or search results.
  • Suggestions:
    • Always inject the current date into the system prompt.
    • Offload arithmetic/date logic to deterministic tools and force LLMs to call them instead of improvising.
    • Avoid ambiguous language like “next Friday,” which even humans disagree on.

Broader Reflections

  • Some compare LLM reasoning to purely mechanical processes: impressive but not indicative of true understanding or consciousness.
  • Others ask how different this really is from human cognition, which may also be mechanistic, though humans add subjective experience.
  • The episode reinforces that disclaimers (“AI responses may include mistakes”) are not enough when fallible AI is placed at the top of critical interfaces like search.