2025-09-17

A postmortem of three recent issues

Scope and Impact of the Incidents

Three issues: misrouting to long‑context servers, output corruption from TPU misconfig, and an approximate top‑k compiler bug.
Debate over impact: some emphasize “<0.0004%” of certain requests and short time windows; others highlight “~30% of Claude Code users saw at least one degraded response,” calling that “huge,” especially given sticky routing.
Users report very noticeable quality drops over weeks, especially for coding and at peak times.

Accountability, SLAs, and Compensation

Several commenters argue that for a paid, high‑priced service, random quality degradation without clear metrics or remediation is unacceptable.
Others note current ToS explicitly disclaim quality guarantees and see this as consistent with today’s LLM landscape.
Comparisons made to SLAs for uptime/throughput vs the difficulty of formally measuring “answer quality.”

Privacy, Data Access, and Feedback

Some initially worry that internal privacy policies hindered debugging; others note this is expected and desirable.
Clarification that thumbs‑down triggers an explicit modal saying the whole conversation is sent for review; some find this adequate, others think many users still won’t grasp the privacy implication.
Discussion on whether Anthropic has limited internal data access vs just contractual language.

Infrastructure, Routing, and Hardware Details

Surprise that Claude is heavily served on TPUs and via multiple clouds (Vertex, Bedrock, Anthropic’s own stack).
Confusion about how much Anthropic can influence AWS Bedrock infrastructure; clarified that Anthropic provides components (like load balancer containers) but cloud providers operate them.
Some want visibility into which hardware/stack a given request is hitting.

Technical Causes: Sampling, Top‑k, and Long Context

Multiple explanations of how LLMs output token probabilities and how sampling (temperature, top‑k/top‑p) and approximate top‑k kernels can go wrong, e.g., selecting improbable tokens or characters from other languages.
Speculation that long‑context variants (1M context) may be less accurate on short inputs due to RoPE scaling or similar techniques.

Reliability, Status Pages, and Trust

Status page shows many incidents; some users say it matches real instability, others praise Anthropic for being unusually honest compared to providers who under‑report outages.
Some argue visible instability undermines enterprise confidence; others say customers presently prioritize model quality over reliability.

Testing Culture and Postmortem Quality

Several readers criticize the postmortem for leaning on “more evals” instead of robust unit/integration tests for deterministic components (routing, sampling kernels, XLA code).
Concern that multiple independent code paths (different hardware and stacks) allow silent regressions without explicit version bumps.
Some praise the technical transparency; others see the tone as self‑aggrandizing and light on concrete prevention measures.

Business Incentives, Quality Drift, and UX

Persistent suspicion that vendors may be tempted to quietly degrade models or quantize to cut costs, given weak external verifiability.
Comparisons to other LLM providers with similar unexplained degradations.
Frustration over support responsiveness, subscription management, and UX rough edges (e.g., login/payment quirks), despite strong model capabilities.

Related topics