Why wordfreq will not be updated
Impact of generative AI on language data
- Many see the open web as increasingly “polluted” by LLM‑generated text, making it impossible to build corpora that represent human language use post‑2021.
- Feedback loops worry people: models overuse certain words (“delve”, “seamless”, etc.), that usage spreads to humans, then back into model training, amplifying quirks and flattening style.
- Some argue this is just another form of language evolution; others fear “model collapse” where both AI and human language converge into generic, low‑information prose.
Data access and enclosure
- Past key text sources (Twitter/X, Reddit) have locked down or monetized APIs, often explicitly to sell data for AI training.
- This dual pressure—AI slop on the open web plus paywalled “clean” data—makes reproducing projects like wordfreq in the old way infeasible.
- Several commenters liken this to low‑background steel or pre‑atomic datasets: pre‑AI text becomes a scarce, valuable resource.
Possible responses and alternatives
- Proposals:
- Maintain curated whitelists of human‑only sources (not publicly listed).
- “Vintage” or “handmade” data services; scanning old books, microfiche, pre‑LLM archives.
- Badges or metadata marking non‑AI content (with skepticism about enforceability).
- Fork wordfreq anyway to study AI’s impact on language rather than avoid it.
- Others retreat to smaller or older media: RSS, self‑hosted blogs, niche forums, books printed before ~2020, offline apps and “private internets”.
Detecting AI vs human text
- Ideas include using word‑frequency fingerprints, perplexity, or GAN‑style discriminator models to spot LLM output.
- Counter‑arguments: detection is a moving target; models can be tuned to evade tests; statistical tests are noisy and confounded by genre, topic, and natural language change.
Broader concerns and attitudes
- Strong pessimism about “enshittification” of the web, search, commerce, and social feeds, with AI seen as an accelerant atop long‑standing SEO/content‑farm problems.
- Others report encountering less obvious AI slop and emphasize LLMs as practical tools that enable tasks (e.g., obscure software configuration) that would otherwise be prohibitively time‑consuming.
- Debate is polarized: some view anti‑AI sentiment as Luddite or ego‑driven; others see AI as inherently political, reshaping jobs, truth, and control over culture and data.