Please do not A/B test my workflow

A/B Testing in a Paid, “Professional” Tool

  • Many see silent A/B tests on core behavior of a dev tool as unacceptable, especially when it’s central to paid workflows.
  • Distinction drawn between:
    • UI-level A/B tests (button color, layout) vs.
    • Changing the output or behavior of the core tool (e.g., how plan mode works, what code gets written).
  • Critics argue this breaks reproducibility, makes debugging and sharing workflows impossible, and should require explicit opt‑in and clear labeling.
  • Others reply that A/B testing is standard for cloud software, is allowed by the ToS, and is needed to improve products; they see outrage as unrealistic given modern SaaS norms.

Determinism, Reliability, and “Professional” Use of LLMs

  • Some claim professional tools must be reliable and replicable; LLM nondeterminism plus hidden prompt changes violate that.
  • Counterpoints:
    • LLMs can be made deterministic via seeds (though major vendors don’t expose this consistently).
    • Real‑world professional systems (finance models, networks, sensors) are already noisy; “trust but verify” is the right posture.
  • Debate over whether LLMs are “like people” or “just autocomplete.” One side warns against anthropomorphism (Eliza effect); the other notes conversational ability and pair‑programming usefulness.

Claude Code, Plan Mode, and Product Quality

  • Multiple reports that plan mode recently degraded: terse, low‑detail plans, odd behavior, more friction.
  • Later clarification from an Anthropic employee: an experiment capped plans at ~40 lines to reduce rate‑limit hits; it showed little benefit and was ended.
  • Users complain about:
    • “Vibe‑coded” CLI, poor QA, unstable updates, inconsistent model quality before new releases.
    • Lack of controls to pin behavior, configure system prompts, or opt out of experiments.
  • Some share workarounds: custom workflows, external planning documents, open‑source harnesses plus Claude API.

Ethics, Consent, and User Impact

  • Strong view that undisclosed experiments on paying users are unethical, especially when they can disrupt work or cause psychological stress.
  • Others argue experimentation is unavoidable in fast‑moving AI products; the real issue is how much user harm is acceptable.
  • Suggestions include opt‑in testing with incentives, IRB‑style oversight, and clearer transparency/controls.

Cost, Value, and Lock‑In

  • Disagreement on whether $200/month is “cheap” or excessive; heavy users report getting far more than that in implied token value.
  • Concern that relying on a single closed vendor for core workflow is classic vendor lock‑in; calls to favor open‑source or self‑hosted alternatives, despite their slower iteration and lack of large‑scale A/B data.