2024-05-21

Wikimedia Enterprise – APIs for LLMs, AI Training, and More

Scope and Purpose of Wikimedia Enterprise

Enterprise offering sits alongside free database dumps and existing APIs.
Main value: real-time, machine-readable content with SLAs, stable contracts, support, and additional inferred data layers (e.g., “breaking news” signals, citation-quality metrics).
Intended customers: large-scale consumers like search engines and LLM trainers who need reliability and structured data rather than raw dumps or Parsoid/EventStreams.

Free vs Paid Access and Data Quality

Dumps and free APIs exist but are described as painful, incomplete, fragile, and not designed as first-class products (e.g., raw SQL dumps, missing some derived data, mirror issues).
Some see a reasonable “pay for convenience and guarantees” model; others fear underinvestment or deliberate degradation of free tooling to upsell Enterprise.
Concern that infobox APIs are already paywalled and that this may discourage moving infobox data into free Wikidata.

Licensing, LLM Training, and Attribution

Wikipedia content is mainly CC-BY-SA; Wikidata is CC0; Commons varies.
Consensus in thread: Enterprise does not change licenses; all license obligations still apply.
Creative Commons materials suggest large-scale text/data mining may be fair use in the US, but legal status is unsettled.
Attribution for LLMs is unclear: per-article or per-editor attribution is impractical; Enterprise responses include editor/version metadata to help.
No revenue sharing with contributors; Creative Commons is not designed for royalties. Some accept this; others feel used.

Funding, WMF Finances, and Incentives

Debate over whether WMF genuinely “needs” more money vs. growing bureaucracy.
Criticism of fundraising banners that imply Wikipedia might “die” while WMF runs large surpluses, funds grants, and has substantial salaries.
Counterpoints: hosting is only a small cost; most spending is on engineers, legal defense, compliance, and ecosystem support; interest on reserves alone is insufficient.
Worry that dependence on Enterprise revenue could create long-term misalignment and “enshittification.”

Value and Reliability of Wikimedia Projects

Wikipedia seen as the primary ML asset; Commons, Wikidata, and Wiktionary also regarded as highly useful by many, though some projects are described as “ghost towns.”
Discussion of Wikipedia’s epistemic model: no original research, reliance on “reliable secondary sources,” consensus processes, and neutrality policies; recognition of strengths and limitations, especially on contentious topics.

AI Use and RAG

Suggestions to build LLMs or RAG systems over Wikipedia with proper citations.
Acknowledgment that RAG over Wikipedia improves but does not eliminate hallucinations.
Practical difficulty of enforcing CC rules when models occasionally reproduce near-verbatim content.

Related topics