Please do not A/B test my workflow
A/B Testing in a Paid, “Professional” Tool
- Many see silent A/B tests on core behavior of a dev tool as unacceptable, especially when it’s central to paid workflows.
- Distinction drawn between:
- UI-level A/B tests (button color, layout) vs.
- Changing the output or behavior of the core tool (e.g., how plan mode works, what code gets written).
- Critics argue this breaks reproducibility, makes debugging and sharing workflows impossible, and should require explicit opt‑in and clear labeling.
- Others reply that A/B testing is standard for cloud software, is allowed by the ToS, and is needed to improve products; they see outrage as unrealistic given modern SaaS norms.
Determinism, Reliability, and “Professional” Use of LLMs
- Some claim professional tools must be reliable and replicable; LLM nondeterminism plus hidden prompt changes violate that.
- Counterpoints:
- LLMs can be made deterministic via seeds (though major vendors don’t expose this consistently).
- Real‑world professional systems (finance models, networks, sensors) are already noisy; “trust but verify” is the right posture.
- Debate over whether LLMs are “like people” or “just autocomplete.” One side warns against anthropomorphism (Eliza effect); the other notes conversational ability and pair‑programming usefulness.
Claude Code, Plan Mode, and Product Quality
- Multiple reports that plan mode recently degraded: terse, low‑detail plans, odd behavior, more friction.
- Later clarification from an Anthropic employee: an experiment capped plans at ~40 lines to reduce rate‑limit hits; it showed little benefit and was ended.
- Users complain about:
- “Vibe‑coded” CLI, poor QA, unstable updates, inconsistent model quality before new releases.
- Lack of controls to pin behavior, configure system prompts, or opt out of experiments.
- Some share workarounds: custom workflows, external planning documents, open‑source harnesses plus Claude API.
Ethics, Consent, and User Impact
- Strong view that undisclosed experiments on paying users are unethical, especially when they can disrupt work or cause psychological stress.
- Others argue experimentation is unavoidable in fast‑moving AI products; the real issue is how much user harm is acceptable.
- Suggestions include opt‑in testing with incentives, IRB‑style oversight, and clearer transparency/controls.
Cost, Value, and Lock‑In
- Disagreement on whether $200/month is “cheap” or excessive; heavy users report getting far more than that in implied token value.
- Concern that relying on a single closed vendor for core workflow is classic vendor lock‑in; calls to favor open‑source or self‑hosted alternatives, despite their slower iteration and lack of large‑scale A/B data.