2024-09-18

Why wordfreq will not be updated

Impact of generative AI on language data

Many see the open web as increasingly “polluted” by LLM‑generated text, making it impossible to build corpora that represent human language use post‑2021.
Feedback loops worry people: models overuse certain words (“delve”, “seamless”, etc.), that usage spreads to humans, then back into model training, amplifying quirks and flattening style.
Some argue this is just another form of language evolution; others fear “model collapse” where both AI and human language converge into generic, low‑information prose.

Data access and enclosure

Past key text sources (Twitter/X, Reddit) have locked down or monetized APIs, often explicitly to sell data for AI training.
This dual pressure—AI slop on the open web plus paywalled “clean” data—makes reproducing projects like wordfreq in the old way infeasible.
Several commenters liken this to low‑background steel or pre‑atomic datasets: pre‑AI text becomes a scarce, valuable resource.

Possible responses and alternatives

Proposals:
- Maintain curated whitelists of human‑only sources (not publicly listed).
- “Vintage” or “handmade” data services; scanning old books, microfiche, pre‑LLM archives.
- Badges or metadata marking non‑AI content (with skepticism about enforceability).
- Fork wordfreq anyway to study AI’s impact on language rather than avoid it.
Others retreat to smaller or older media: RSS, self‑hosted blogs, niche forums, books printed before ~2020, offline apps and “private internets”.

Detecting AI vs human text

Ideas include using word‑frequency fingerprints, perplexity, or GAN‑style discriminator models to spot LLM output.
Counter‑arguments: detection is a moving target; models can be tuned to evade tests; statistical tests are noisy and confounded by genre, topic, and natural language change.

Broader concerns and attitudes

Strong pessimism about “enshittification” of the web, search, commerce, and social feeds, with AI seen as an accelerant atop long‑standing SEO/content‑farm problems.
Others report encountering less obvious AI slop and emphasize LLMs as practical tools that enable tasks (e.g., obscure software configuration) that would otherwise be prohibitively time‑consuming.
Debate is polarized: some view anti‑AI sentiment as Luddite or ego‑driven; others see AI as inherently political, reshaping jobs, truth, and control over culture and data.

Related topics