Wikimedia Enterprise – APIs for LLMs, AI Training, and More
Scope and Purpose of Wikimedia Enterprise
- Enterprise offering sits alongside free database dumps and existing APIs.
- Main value: real-time, machine-readable content with SLAs, stable contracts, support, and additional inferred data layers (e.g., “breaking news” signals, citation-quality metrics).
- Intended customers: large-scale consumers like search engines and LLM trainers who need reliability and structured data rather than raw dumps or Parsoid/EventStreams.
Free vs Paid Access and Data Quality
- Dumps and free APIs exist but are described as painful, incomplete, fragile, and not designed as first-class products (e.g., raw SQL dumps, missing some derived data, mirror issues).
- Some see a reasonable “pay for convenience and guarantees” model; others fear underinvestment or deliberate degradation of free tooling to upsell Enterprise.
- Concern that infobox APIs are already paywalled and that this may discourage moving infobox data into free Wikidata.
Licensing, LLM Training, and Attribution
- Wikipedia content is mainly CC-BY-SA; Wikidata is CC0; Commons varies.
- Consensus in thread: Enterprise does not change licenses; all license obligations still apply.
- Creative Commons materials suggest large-scale text/data mining may be fair use in the US, but legal status is unsettled.
- Attribution for LLMs is unclear: per-article or per-editor attribution is impractical; Enterprise responses include editor/version metadata to help.
- No revenue sharing with contributors; Creative Commons is not designed for royalties. Some accept this; others feel used.
Funding, WMF Finances, and Incentives
- Debate over whether WMF genuinely “needs” more money vs. growing bureaucracy.
- Criticism of fundraising banners that imply Wikipedia might “die” while WMF runs large surpluses, funds grants, and has substantial salaries.
- Counterpoints: hosting is only a small cost; most spending is on engineers, legal defense, compliance, and ecosystem support; interest on reserves alone is insufficient.
- Worry that dependence on Enterprise revenue could create long-term misalignment and “enshittification.”
Value and Reliability of Wikimedia Projects
- Wikipedia seen as the primary ML asset; Commons, Wikidata, and Wiktionary also regarded as highly useful by many, though some projects are described as “ghost towns.”
- Discussion of Wikipedia’s epistemic model: no original research, reliance on “reliable secondary sources,” consensus processes, and neutrality policies; recognition of strengths and limitations, especially on contentious topics.
AI Use and RAG
- Suggestions to build LLMs or RAG systems over Wikipedia with proper citations.
- Acknowledgment that RAG over Wikipedia improves but does not eliminate hallucinations.
- Practical difficulty of enforcing CC rules when models occasionally reproduce near-verbatim content.