OSI readies controversial open-source AI definition
Scope of the OSI AI Definition
- OSI is proposing an “open source AI” definition where releasing model weights is required; releasing training data is treated as optional but beneficial.
- Some see this as a pragmatic compromise aligned with how models are actually built and used; others see it as watering down “open source” to suit large corporate sponsors.
Is Training Data Part of the “Source”?
- One camp: training data + training code + architecture are the true “source”; weights are just a compiled artifact. Without data, models are akin to binaries without source.
- Opposing camp: training data is like a development input or process log; the artifact being shared is the weights, and those are what people actually modify (via fine‑tuning).
Weights as Source vs Object Code
- Analogies used:
- Weights as object code; training data as source; trainer as compiler.
- Weights as ROMs or databases; inference engine as interpreter.
- Counter‑argument: companies themselves prefer to fine‑tune weights rather than retrain, so weights are the “preferred form for modification” and thus function as source.
Reproducibility and Freedom
- One view: if you can’t reproduce approximately the same model from public materials, it’s not open. Cost and non‑determinism don’t change that.
- Other view: open source has never required full reproducibility of the creative process (e.g., developer thoughts); publishing the primary modifiable artifact under a free license is enough.
- Debate over whether “preferred form” should depend on current training cost; critics say that makes the definition unstable.
Governance, Branding, and Corporate Influence
- Strong distrust of OSI’s role and sponsors (Meta, Microsoft, Salesforce, etc.); accusations of corporate capture and redefining “open” to protect proprietary data moats.
- Some argue the community, not OSI, should define “open”, and suggest waiting for FSF or Debian-style policies instead.
- Others respond that language follows common usage and legal definitions; a stricter, less-used definition will simply be ignored.
Regulation and Legal Angles
- The EU AI Act exempts “open source” systems from some burdens; if OSI calls closed‑data models “open”, commenters fear a regulatory loophole for opaque, high-risk systems.
- Disagreement over whether OSI’s definition already matches emerging legal usage, or actively reshapes it.
- Questions raised about liability when users can’t alter training data but only tweak weights.
Ethical, Safety, and Auditability Concerns
- Critics say you can’t meaningfully audit safety, bias, or test contamination without training data and alignment details.
- Others reply that current architectures are barely explainable even with full data, but concede data still matters for spotting bias, illegal content, and benchmark leakage.
- Security worries include undetectable backdoors in models and the impossibility of robustly auditing huge weight blobs.
Does “Open Source” Even Fit AI?
- Some argue the concept doesn’t map: AI has no human-readable “source code” equivalent; weights are opaque; openness might be better framed in terms of “data commons” or Creative Commons–style licensing.
- Others think the Open Source Definition could be extended to data and models with minimal changes, but warn against destabilizing a 25‑year‑old concept.
Proposed Alternatives / Terminology
- Suggestions:
- Use terms like “open weights” instead of “open source AI” when data isn’t public.
- Maintain a clear split between “open source” (with data) and weaker labels (without).
- Add new AI‑specific open licenses, rather than a single grand definition.
- Some foresee a substantive split between “open source” and “free software” for AI, ending the usual F/OSS umbrella.