Wikimedia Enterprise – APIs for LLMs, AI Training, and More

Scope and Purpose of Wikimedia Enterprise

  • Enterprise offering sits alongside free database dumps and existing APIs.
  • Main value: real-time, machine-readable content with SLAs, stable contracts, support, and additional inferred data layers (e.g., “breaking news” signals, citation-quality metrics).
  • Intended customers: large-scale consumers like search engines and LLM trainers who need reliability and structured data rather than raw dumps or Parsoid/EventStreams.

Free vs Paid Access and Data Quality

  • Dumps and free APIs exist but are described as painful, incomplete, fragile, and not designed as first-class products (e.g., raw SQL dumps, missing some derived data, mirror issues).
  • Some see a reasonable “pay for convenience and guarantees” model; others fear underinvestment or deliberate degradation of free tooling to upsell Enterprise.
  • Concern that infobox APIs are already paywalled and that this may discourage moving infobox data into free Wikidata.

Licensing, LLM Training, and Attribution

  • Wikipedia content is mainly CC-BY-SA; Wikidata is CC0; Commons varies.
  • Consensus in thread: Enterprise does not change licenses; all license obligations still apply.
  • Creative Commons materials suggest large-scale text/data mining may be fair use in the US, but legal status is unsettled.
  • Attribution for LLMs is unclear: per-article or per-editor attribution is impractical; Enterprise responses include editor/version metadata to help.
  • No revenue sharing with contributors; Creative Commons is not designed for royalties. Some accept this; others feel used.

Funding, WMF Finances, and Incentives

  • Debate over whether WMF genuinely “needs” more money vs. growing bureaucracy.
  • Criticism of fundraising banners that imply Wikipedia might “die” while WMF runs large surpluses, funds grants, and has substantial salaries.
  • Counterpoints: hosting is only a small cost; most spending is on engineers, legal defense, compliance, and ecosystem support; interest on reserves alone is insufficient.
  • Worry that dependence on Enterprise revenue could create long-term misalignment and “enshittification.”

Value and Reliability of Wikimedia Projects

  • Wikipedia seen as the primary ML asset; Commons, Wikidata, and Wiktionary also regarded as highly useful by many, though some projects are described as “ghost towns.”
  • Discussion of Wikipedia’s epistemic model: no original research, reliance on “reliable secondary sources,” consensus processes, and neutrality policies; recognition of strengths and limitations, especially on contentious topics.

AI Use and RAG

  • Suggestions to build LLMs or RAG systems over Wikipedia with proper citations.
  • Acknowledgment that RAG over Wikipedia improves but does not eliminate hallucinations.
  • Practical difficulty of enforcing CC rules when models occasionally reproduce near-verbatim content.