Why wordfreq will not be updated

Impact of generative AI on language data

  • Many see the open web as increasingly “polluted” by LLM‑generated text, making it impossible to build corpora that represent human language use post‑2021.
  • Feedback loops worry people: models overuse certain words (“delve”, “seamless”, etc.), that usage spreads to humans, then back into model training, amplifying quirks and flattening style.
  • Some argue this is just another form of language evolution; others fear “model collapse” where both AI and human language converge into generic, low‑information prose.

Data access and enclosure

  • Past key text sources (Twitter/X, Reddit) have locked down or monetized APIs, often explicitly to sell data for AI training.
  • This dual pressure—AI slop on the open web plus paywalled “clean” data—makes reproducing projects like wordfreq in the old way infeasible.
  • Several commenters liken this to low‑background steel or pre‑atomic datasets: pre‑AI text becomes a scarce, valuable resource.

Possible responses and alternatives

  • Proposals:
    • Maintain curated whitelists of human‑only sources (not publicly listed).
    • “Vintage” or “handmade” data services; scanning old books, microfiche, pre‑LLM archives.
    • Badges or metadata marking non‑AI content (with skepticism about enforceability).
    • Fork wordfreq anyway to study AI’s impact on language rather than avoid it.
  • Others retreat to smaller or older media: RSS, self‑hosted blogs, niche forums, books printed before ~2020, offline apps and “private internets”.

Detecting AI vs human text

  • Ideas include using word‑frequency fingerprints, perplexity, or GAN‑style discriminator models to spot LLM output.
  • Counter‑arguments: detection is a moving target; models can be tuned to evade tests; statistical tests are noisy and confounded by genre, topic, and natural language change.

Broader concerns and attitudes

  • Strong pessimism about “enshittification” of the web, search, commerce, and social feeds, with AI seen as an accelerant atop long‑standing SEO/content‑farm problems.
  • Others report encountering less obvious AI slop and emphasize LLMs as practical tools that enable tasks (e.g., obscure software configuration) that would otherwise be prohibitively time‑consuming.
  • Debate is polarized: some view anti‑AI sentiment as Luddite or ego‑driven; others see AI as inherently political, reshaping jobs, truth, and control over culture and data.