2026-02-27

ChatGPT Health fails to recognise medical emergencies – study

Perceived Risks and Misuse

Many see it as reckless to deploy LLMs where errors can kill, especially if tied to insurers whose incentives favor denying care.
Concern that AI can be more easily steered into unethical behavior than humans bound by professional oaths.
Several argue current systems are only at “knowledgeable friend” level and should not be treated as professionals.

Reliability and Failure Modes

Multiple anecdotes of LLMs confidently hallucinating: wrong product features, non‑existent addresses, wrong environment in DevOps, bogus Sudoku moves.
In health contexts: missed diagnosis that later required emergency surgery; dangerous dosing in Google AI summaries; GP prescribing alcohol-heavy cough syrup to a pregnant woman based on ChatGPT; triage flags (e.g., suicide risk) disappearing when unrelated “normal” data is added.
People note LLMs sound authoritative, unlike WebMD-style reference pages, which may amplify over-trust.

Comparing AI and Doctors

Some doctors already use ChatGPT as an adjunct; proponents say “AI+expert” can be valuable, critics fear complacency makes it effectively “AI alone.”
Debate over “humans suck too”: anecdotes of serious missed emergencies by doctors; others push back that doctors as a group are still far more reliable.
Suggestions to benchmark: (A) doctors alone, (B) LLM alone, (C) doctors using LLMs.

Study Design and Ethics

Skeptics dislike studies where experts construct hypothetical scenarios and then judge AI against their own “gold standards,” preferring blinded comparisons with doctors.
Defenders argue real randomized AI-vs-doctor trials are ethically fraught; scenario-based evaluation is a necessary early step.
Others note scenarios don’t match messy, ambiguous real patient queries, limiting external validity.

Patient Behavior and Healthcare Access

High US healthcare costs and appointment backlogs push people to ChatGPT despite known risks; for some, the alternative is doing nothing.
Self-diagnosis (whether via Google or ChatGPT) can bias doctors, waste limited appointment time, or delay correct diagnosis; but informed patients can sometimes help.

Regulation, Deployment, and Data Privacy

Calls for full FDA-style trials and rejection of “move fast and break things” in medicine, countered by reminders that informal tools like Wikipedia already influence care.
Worries about “securely” linking medical records to AI systems, large attack surfaces, and future legal discovery of chat histories.
Some note ChatGPT Health and its HealthBench benchmark missing emergencies suggests serious external-validity and safety gaps.

Limits of LLMs vs Clinical Practice

Repeated emphasis that medical competence comes largely from years of hands-on rounds, messy real cases, tacit knowledge, and human interaction—none of which appear directly in training text.
Several argue this gap explains why models trained on the same textbooks as doctors still fail at real-world triage.

Related topics