Ontario auditors find doctors' AI note takers routinely blow basic facts
Scope of AI Note-Taker Problems
- Multiple anecdotes of LLM note-takers fabricating or distorting key details in meetings and medical visits.
- Examples include: a vendor “promising” something they did not; Zoom summaries misattributing statements; a runner’s knee visit turned into an osteoporosis diagnosis with invented symptoms.
- Users report that for simple, linear interactions they can “get the gist,” but fail badly on nuanced, technical, or emotionally charged conversations.
Transcripts vs Summaries & Provenance
- Several commenters argue transcripts should be the legal/clinical ground truth, with optional human-written summaries.
- Others note speech-to-text itself is probabilistic and can also mislead if treated as authoritative.
- Strong support for timestamped links from summaries back to recordings (“provenance”) in non-medical settings, but concern this is harder in HIPAA-like environments.
Medical Context, Risk, and Responsibility
- Many see AI scribes in healthcare as especially dangerous: mixing up drugs or diagnoses is unacceptable.
- Some argue human documentation is already error-prone, but others stress:
- Machines must be better than humans to be worth using.
- AI errors are qualitatively different (confident hallucinations of things never said).
- Patients are urged to check visit summaries and request corrections; some already do this routinely.
- Doctors report having to spend extra time correcting AI notes, sometimes feeling the tech is being forced on them.
Procurement, Incentives, and Data Exploitation
- Ontario’s vendor scoring is criticized: domestic presence heavily weighted, note accuracy only a small part of the score.
- Commenters worry less about short-term accuracy than about long-term incentives: real-time data feeds into insurers, pharma, and hospital billing, with little alignment to patient interests.
Capabilities vs Reliability and “Knowing What It Doesn’t Know”
- Ongoing debate over whether model accuracy will naturally improve enough for critical use.
- Distinction raised between capability (benchmarks, impressive demos) and reliability (consistent, low-risk behavior in production).
- Extended side-discussion on confidence estimation and calibration:
- Models can output probability distributions over tokens, but turning this into trustworthy “I don’t know” behavior remains unsolved in practice.
- Some believe models could be trained to refuse answers more often; others think business incentives discourage visible uncertainty.
Privacy, Recording, and Appropriate Use
- Split views on recording whole doctor–patient conversations:
- One side sees comprehensive recording as aligned with the idea of medical records.
- The other stresses the historic role of physicians as filters, privacy risks over a lifetime, and chilling effects on honest disclosure.
- Several argue that AI is being misapplied to high-accuracy back-end tasks like clinical documentation, and might be more appropriate for front-end intake, triage help, or form-filling—always with human verification.