UCSD: Large Language Models Pass the Turing Test

What the Turing Test Means (and Whether This Counts)

  • Several commenters stress Turing’s “imitation game” was a philosophical tool about intersubjective recognition, not a precise engineering benchmark.
  • Others argue the test is really about humans’ susceptibility to being fooled, not about machine intelligence.
  • Some say this result mostly indicts the Turing test: LLMs clearly can’t replace humans in many intellectual tasks yet still “pass,” so the test is weak as an intelligence measure.
  • There’s debate whether GPT‑4.5 being picked as “human” 73% of the time is a pass or a fail:
    • One view: success should be ~50%; any deviation shows systematic difference, hence distinguishability.
    • Counterview: from the interrogator’s binary perspective, consistent misclassification still shows the model is more human‑seeming than the human.

Methodology, Interrogators, and Prompting

  • Original Turing 5‑minute duration is noted; some ask what happens with longer, richer conversations.
  • Released transcripts show many interrogators doing minimal small talk for course credit, not serious adversarial probing.
  • Commenters suggest stronger incentives (cash rewards) and explicit encouragement to “break” the system would matter.
  • People highlight that prompting (“humanlike persona”) drastically changes outcomes; baseline GPT‑4o/ELIZA do poorly, GPT‑4.5 with persona does very well.
  • Some argue trivial jailbreak or policy‑violation prompts could still easily reveal many current LLMs, so this setup is not an “accurate” Turing test.

Philosophical Debates: Understanding vs. Imitation

  • Long subthread on the Chinese Room:
    • One side: symbol manipulation without environmental grounding cannot yield real “meaning”; LLM‑style competence is only syntactic.
    • Other side: if the overall system behaves as if it understands, insisting it “doesn’t really” is arbitrary or dualistic; human brains are likewise composed of non‑understanding parts.
  • Some emphasize that human intelligence crucially involves a world model tied to perception and action, which current LLMs largely lack.

Implications and Risks

  • Many express concern that models judged “more human than humans” could be especially persuasive in debate, propaganda, or scams.
  • Commenters note that RLHF explicitly trains models to be engaging and likable, which likely biases humans toward selecting them as “human.”
  • Others see this as a Goodhart’s law effect: once “sounds human” becomes the optimization target, systems become excellent at that specific surface criterion without deeper intelligence.