2024-07-04

The sad state of property-based testing libraries

Stateful and Parallel Property-Based Testing

Some argue state-machine frameworks add too much ceremony; they prefer hand-rolled “list of commands + property” tests for stateful systems.
Others highlight benefits of dedicated frameworks: parallel/linearizability checking, automatic management of model state and preconditions, and systematic exploration of interleavings.
There is debate over how practical it is to control scheduling and interleavings on mainstream runtimes, with examples like .NET tooling (e.g., IL rewriting) and specialized concurrency test frameworks.

Shrinking Strategies and Generators

Shrinking is repeatedly described as the main value of property testing.
Three main approaches are contrasted:
- Value-level shrinkers (QuickCheck-style) that often break invariants.
- Integrated “rose tree” shrinking (Hedgehog), which respects generators but struggles with monadic bind.
- Internal shrinking (Hypothesis), which shrinks the underlying choice stream; usually “just works” but can misbehave with complex generators.
Applicative vs monadic generators: applicative structures enable better, more independent shrinking, while monadic sequencing can entangle shrink behavior.
Alternative designs like “internal integrated shrinking” (e.g., falsify) and LazySmallcheck’s laziness/coverage-style exploration are discussed as promising.

Hypothesis: Strengths and Frustrations

Fans praise Hypothesis for automatic shrinking, “nasty” edge-case generation (e.g., special floats), and strong DX compared to classic libraries.
Critics report pathological behavior (e.g., only generating trivial cases like zero matrices) and state it can be “far from just working,” especially with large/filtered data.
There is disagreement over whether such failures reflect user misuse (e.g., heavy filtering) or fundamental limitations of Hypothesis’ design.

Fuzzing vs Property-Based Testing

Several comments argue modern coverage-guided fuzzing (libFuzzer, AFL++, Go fuzzing) has effectively converged with property testing when combined with structured inputs.
Structure-aware fuzzing (e.g., deriving generators from types, command sequences for stateful APIs) can reach deep bugs and high coverage, sometimes more easily than classic PBT.
Others stress that PBT is more about specifying and documenting behavioral properties, while fuzzing is about coverage and finding crashes; boundaries are acknowledged as blurry.

Practical Usage and Hand-Rolled Approaches

Many practitioners report success with lightweight, custom property tests:
- Manual generators seeded by PRNGs with logged seeds for replay.
- BFS/“beam search” style progression from simple to complex cases as a rough substitute for shrinking.
- Using slow but simple reference implementations as property oracles for optimized versions.
Some give up on heavy frameworks due to complexity, missing features, or poor experience and rely on manual randomized tests plus hand-written unit tests.

Tooling Ecosystem and Adoption Barriers

Comments mention a wide range of libraries across ecosystems (QuickCheck variants, Hedgehog, Hypothesis, PropEr, clojure.spec, CsCheck, Quviq QuickCheck, AWS Shuttle, Coyote, etc.), but many lack full feature sets (stateful/parallel models, good shrinking).
The complexity of splittable RNGs, advanced shrinking, and sophisticated generators is noted as a reason many libraries remain minimal.
A key barrier is training and organizational cost: advanced PBT requires specialized knowledge, careful generators, and maintenance; teams worry about onboarding, misuse, and flaky or inscrutable tests.
Some argue that, for many real-world systems, especially large stateful ones, the extra complexity and runtime cost are hard to justify over simpler tests and fuzzing.

Reproducibility and Research Practices

There is a side thread about research reproducibility: whether papers that rely on proprietary tools (e.g., commercial QuickCheck variants) should be required to provide open or at least reviewer-accessible artifacts.
Opinions range from “reproducibility is foundational” to “strict requirements would block otherwise valuable papers,” with mention of conferences experimenting with artifact evaluation tracks as a middle ground.

Related topics