2026-05-14

Ontario auditors find doctors' AI note takers routinely blow basic facts

Scope of AI Note-Taker Problems

Multiple anecdotes of LLM note-takers fabricating or distorting key details in meetings and medical visits.
Examples include: a vendor “promising” something they did not; Zoom summaries misattributing statements; a runner’s knee visit turned into an osteoporosis diagnosis with invented symptoms.
Users report that for simple, linear interactions they can “get the gist,” but fail badly on nuanced, technical, or emotionally charged conversations.

Transcripts vs Summaries & Provenance

Several commenters argue transcripts should be the legal/clinical ground truth, with optional human-written summaries.
Others note speech-to-text itself is probabilistic and can also mislead if treated as authoritative.
Strong support for timestamped links from summaries back to recordings (“provenance”) in non-medical settings, but concern this is harder in HIPAA-like environments.

Medical Context, Risk, and Responsibility

Many see AI scribes in healthcare as especially dangerous: mixing up drugs or diagnoses is unacceptable.
Some argue human documentation is already error-prone, but others stress:
- Machines must be better than humans to be worth using.
- AI errors are qualitatively different (confident hallucinations of things never said).
Patients are urged to check visit summaries and request corrections; some already do this routinely.
Doctors report having to spend extra time correcting AI notes, sometimes feeling the tech is being forced on them.

Procurement, Incentives, and Data Exploitation

Ontario’s vendor scoring is criticized: domestic presence heavily weighted, note accuracy only a small part of the score.
Commenters worry less about short-term accuracy than about long-term incentives: real-time data feeds into insurers, pharma, and hospital billing, with little alignment to patient interests.

Capabilities vs Reliability and “Knowing What It Doesn’t Know”

Ongoing debate over whether model accuracy will naturally improve enough for critical use.
Distinction raised between capability (benchmarks, impressive demos) and reliability (consistent, low-risk behavior in production).
Extended side-discussion on confidence estimation and calibration:
- Models can output probability distributions over tokens, but turning this into trustworthy “I don’t know” behavior remains unsolved in practice.
- Some believe models could be trained to refuse answers more often; others think business incentives discourage visible uncertainty.

Privacy, Recording, and Appropriate Use

Split views on recording whole doctor–patient conversations:
- One side sees comprehensive recording as aligned with the idea of medical records.
- The other stresses the historic role of physicians as filters, privacy risks over a lifetime, and chilling effects on honest disclosure.
Several argue that AI is being misapplied to high-accuracy back-end tasks like clinical documentation, and might be more appropriate for front-end intake, triage help, or form-filling—always with human verification.

Related topics