ChatGPT Health fails to recognise medical emergencies – study

Perceived Risks and Misuse

  • Many see it as reckless to deploy LLMs where errors can kill, especially if tied to insurers whose incentives favor denying care.
  • Concern that AI can be more easily steered into unethical behavior than humans bound by professional oaths.
  • Several argue current systems are only at “knowledgeable friend” level and should not be treated as professionals.

Reliability and Failure Modes

  • Multiple anecdotes of LLMs confidently hallucinating: wrong product features, non‑existent addresses, wrong environment in DevOps, bogus Sudoku moves.
  • In health contexts: missed diagnosis that later required emergency surgery; dangerous dosing in Google AI summaries; GP prescribing alcohol-heavy cough syrup to a pregnant woman based on ChatGPT; triage flags (e.g., suicide risk) disappearing when unrelated “normal” data is added.
  • People note LLMs sound authoritative, unlike WebMD-style reference pages, which may amplify over-trust.

Comparing AI and Doctors

  • Some doctors already use ChatGPT as an adjunct; proponents say “AI+expert” can be valuable, critics fear complacency makes it effectively “AI alone.”
  • Debate over “humans suck too”: anecdotes of serious missed emergencies by doctors; others push back that doctors as a group are still far more reliable.
  • Suggestions to benchmark: (A) doctors alone, (B) LLM alone, (C) doctors using LLMs.

Study Design and Ethics

  • Skeptics dislike studies where experts construct hypothetical scenarios and then judge AI against their own “gold standards,” preferring blinded comparisons with doctors.
  • Defenders argue real randomized AI-vs-doctor trials are ethically fraught; scenario-based evaluation is a necessary early step.
  • Others note scenarios don’t match messy, ambiguous real patient queries, limiting external validity.

Patient Behavior and Healthcare Access

  • High US healthcare costs and appointment backlogs push people to ChatGPT despite known risks; for some, the alternative is doing nothing.
  • Self-diagnosis (whether via Google or ChatGPT) can bias doctors, waste limited appointment time, or delay correct diagnosis; but informed patients can sometimes help.

Regulation, Deployment, and Data Privacy

  • Calls for full FDA-style trials and rejection of “move fast and break things” in medicine, countered by reminders that informal tools like Wikipedia already influence care.
  • Worries about “securely” linking medical records to AI systems, large attack surfaces, and future legal discovery of chat histories.
  • Some note ChatGPT Health and its HealthBench benchmark missing emergencies suggests serious external-validity and safety gaps.

Limits of LLMs vs Clinical Practice

  • Repeated emphasis that medical competence comes largely from years of hands-on rounds, messy real cases, tacit knowledge, and human interaction—none of which appear directly in training text.
  • Several argue this gap explains why models trained on the same textbooks as doctors still fail at real-world triage.