A postmortem of three recent issues

Scope and Impact of the Incidents

  • Three issues: misrouting to long‑context servers, output corruption from TPU misconfig, and an approximate top‑k compiler bug.
  • Debate over impact: some emphasize “<0.0004%” of certain requests and short time windows; others highlight “~30% of Claude Code users saw at least one degraded response,” calling that “huge,” especially given sticky routing.
  • Users report very noticeable quality drops over weeks, especially for coding and at peak times.

Accountability, SLAs, and Compensation

  • Several commenters argue that for a paid, high‑priced service, random quality degradation without clear metrics or remediation is unacceptable.
  • Others note current ToS explicitly disclaim quality guarantees and see this as consistent with today’s LLM landscape.
  • Comparisons made to SLAs for uptime/throughput vs the difficulty of formally measuring “answer quality.”

Privacy, Data Access, and Feedback

  • Some initially worry that internal privacy policies hindered debugging; others note this is expected and desirable.
  • Clarification that thumbs‑down triggers an explicit modal saying the whole conversation is sent for review; some find this adequate, others think many users still won’t grasp the privacy implication.
  • Discussion on whether Anthropic has limited internal data access vs just contractual language.

Infrastructure, Routing, and Hardware Details

  • Surprise that Claude is heavily served on TPUs and via multiple clouds (Vertex, Bedrock, Anthropic’s own stack).
  • Confusion about how much Anthropic can influence AWS Bedrock infrastructure; clarified that Anthropic provides components (like load balancer containers) but cloud providers operate them.
  • Some want visibility into which hardware/stack a given request is hitting.

Technical Causes: Sampling, Top‑k, and Long Context

  • Multiple explanations of how LLMs output token probabilities and how sampling (temperature, top‑k/top‑p) and approximate top‑k kernels can go wrong, e.g., selecting improbable tokens or characters from other languages.
  • Speculation that long‑context variants (1M context) may be less accurate on short inputs due to RoPE scaling or similar techniques.

Reliability, Status Pages, and Trust

  • Status page shows many incidents; some users say it matches real instability, others praise Anthropic for being unusually honest compared to providers who under‑report outages.
  • Some argue visible instability undermines enterprise confidence; others say customers presently prioritize model quality over reliability.

Testing Culture and Postmortem Quality

  • Several readers criticize the postmortem for leaning on “more evals” instead of robust unit/integration tests for deterministic components (routing, sampling kernels, XLA code).
  • Concern that multiple independent code paths (different hardware and stacks) allow silent regressions without explicit version bumps.
  • Some praise the technical transparency; others see the tone as self‑aggrandizing and light on concrete prevention measures.

Business Incentives, Quality Drift, and UX

  • Persistent suspicion that vendors may be tempted to quietly degrade models or quantize to cut costs, given weak external verifiability.
  • Comparisons to other LLM providers with similar unexplained degradations.
  • Frustration over support responsiveness, subscription management, and UX rough edges (e.g., login/payment quirks), despite strong model capabilities.