2026-03-14

Please do not A/B test my workflow

A/B Testing in a Paid, “Professional” Tool

Many see silent A/B tests on core behavior of a dev tool as unacceptable, especially when it’s central to paid workflows.
Distinction drawn between:
- UI-level A/B tests (button color, layout) vs.
- Changing the output or behavior of the core tool (e.g., how plan mode works, what code gets written).
Critics argue this breaks reproducibility, makes debugging and sharing workflows impossible, and should require explicit opt‑in and clear labeling.
Others reply that A/B testing is standard for cloud software, is allowed by the ToS, and is needed to improve products; they see outrage as unrealistic given modern SaaS norms.

Determinism, Reliability, and “Professional” Use of LLMs

Some claim professional tools must be reliable and replicable; LLM nondeterminism plus hidden prompt changes violate that.
Counterpoints:
- LLMs can be made deterministic via seeds (though major vendors don’t expose this consistently).
- Real‑world professional systems (finance models, networks, sensors) are already noisy; “trust but verify” is the right posture.
Debate over whether LLMs are “like people” or “just autocomplete.” One side warns against anthropomorphism (Eliza effect); the other notes conversational ability and pair‑programming usefulness.

Claude Code, Plan Mode, and Product Quality

Multiple reports that plan mode recently degraded: terse, low‑detail plans, odd behavior, more friction.
Later clarification from an Anthropic employee: an experiment capped plans at ~40 lines to reduce rate‑limit hits; it showed little benefit and was ended.
Users complain about:
- “Vibe‑coded” CLI, poor QA, unstable updates, inconsistent model quality before new releases.
- Lack of controls to pin behavior, configure system prompts, or opt out of experiments.
Some share workarounds: custom workflows, external planning documents, open‑source harnesses plus Claude API.

Ethics, Consent, and User Impact

Strong view that undisclosed experiments on paying users are unethical, especially when they can disrupt work or cause psychological stress.
Others argue experimentation is unavoidable in fast‑moving AI products; the real issue is how much user harm is acceptable.
Suggestions include opt‑in testing with incentives, IRB‑style oversight, and clearer transparency/controls.

Cost, Value, and Lock‑In

Disagreement on whether $200/month is “cheap” or excessive; heavy users report getting far more than that in implied token value.
Concern that relying on a single closed vendor for core workflow is classic vendor lock‑in; calls to favor open‑source or self‑hosted alternatives, despite their slower iteration and lack of large‑scale A/B data.

Related topics