2024-10-26

OSI readies controversial open-source AI definition

Scope of the OSI AI Definition

OSI is proposing an “open source AI” definition where releasing model weights is required; releasing training data is treated as optional but beneficial.
Some see this as a pragmatic compromise aligned with how models are actually built and used; others see it as watering down “open source” to suit large corporate sponsors.

Is Training Data Part of the “Source”?

One camp: training data + training code + architecture are the true “source”; weights are just a compiled artifact. Without data, models are akin to binaries without source.
Opposing camp: training data is like a development input or process log; the artifact being shared is the weights, and those are what people actually modify (via fine‑tuning).

Weights as Source vs Object Code

Analogies used:
- Weights as object code; training data as source; trainer as compiler.
- Weights as ROMs or databases; inference engine as interpreter.
- Counter‑argument: companies themselves prefer to fine‑tune weights rather than retrain, so weights are the “preferred form for modification” and thus function as source.

Reproducibility and Freedom

One view: if you can’t reproduce approximately the same model from public materials, it’s not open. Cost and non‑determinism don’t change that.
Other view: open source has never required full reproducibility of the creative process (e.g., developer thoughts); publishing the primary modifiable artifact under a free license is enough.
Debate over whether “preferred form” should depend on current training cost; critics say that makes the definition unstable.

Governance, Branding, and Corporate Influence

Strong distrust of OSI’s role and sponsors (Meta, Microsoft, Salesforce, etc.); accusations of corporate capture and redefining “open” to protect proprietary data moats.
Some argue the community, not OSI, should define “open”, and suggest waiting for FSF or Debian-style policies instead.
Others respond that language follows common usage and legal definitions; a stricter, less-used definition will simply be ignored.

Regulation and Legal Angles

The EU AI Act exempts “open source” systems from some burdens; if OSI calls closed‑data models “open”, commenters fear a regulatory loophole for opaque, high-risk systems.
Disagreement over whether OSI’s definition already matches emerging legal usage, or actively reshapes it.
Questions raised about liability when users can’t alter training data but only tweak weights.

Ethical, Safety, and Auditability Concerns

Critics say you can’t meaningfully audit safety, bias, or test contamination without training data and alignment details.
Others reply that current architectures are barely explainable even with full data, but concede data still matters for spotting bias, illegal content, and benchmark leakage.
Security worries include undetectable backdoors in models and the impossibility of robustly auditing huge weight blobs.

Does “Open Source” Even Fit AI?

Some argue the concept doesn’t map: AI has no human-readable “source code” equivalent; weights are opaque; openness might be better framed in terms of “data commons” or Creative Commons–style licensing.
Others think the Open Source Definition could be extended to data and models with minimal changes, but warn against destabilizing a 25‑year‑old concept.

Proposed Alternatives / Terminology

Suggestions:
- Use terms like “open weights” instead of “open source AI” when data isn’t public.
- Maintain a clear split between “open source” (with data) and weaker labels (without).
- Add new AI‑specific open licenses, rather than a single grand definition.
Some foresee a substantive split between “open source” and “free software” for AI, ending the usual F/OSS umbrella.

Related topics