OSI readies controversial open-source AI definition

Scope of the OSI AI Definition

  • OSI is proposing an “open source AI” definition where releasing model weights is required; releasing training data is treated as optional but beneficial.
  • Some see this as a pragmatic compromise aligned with how models are actually built and used; others see it as watering down “open source” to suit large corporate sponsors.

Is Training Data Part of the “Source”?

  • One camp: training data + training code + architecture are the true “source”; weights are just a compiled artifact. Without data, models are akin to binaries without source.
  • Opposing camp: training data is like a development input or process log; the artifact being shared is the weights, and those are what people actually modify (via fine‑tuning).

Weights as Source vs Object Code

  • Analogies used:
    • Weights as object code; training data as source; trainer as compiler.
    • Weights as ROMs or databases; inference engine as interpreter.
    • Counter‑argument: companies themselves prefer to fine‑tune weights rather than retrain, so weights are the “preferred form for modification” and thus function as source.

Reproducibility and Freedom

  • One view: if you can’t reproduce approximately the same model from public materials, it’s not open. Cost and non‑determinism don’t change that.
  • Other view: open source has never required full reproducibility of the creative process (e.g., developer thoughts); publishing the primary modifiable artifact under a free license is enough.
  • Debate over whether “preferred form” should depend on current training cost; critics say that makes the definition unstable.

Governance, Branding, and Corporate Influence

  • Strong distrust of OSI’s role and sponsors (Meta, Microsoft, Salesforce, etc.); accusations of corporate capture and redefining “open” to protect proprietary data moats.
  • Some argue the community, not OSI, should define “open”, and suggest waiting for FSF or Debian-style policies instead.
  • Others respond that language follows common usage and legal definitions; a stricter, less-used definition will simply be ignored.

Regulation and Legal Angles

  • The EU AI Act exempts “open source” systems from some burdens; if OSI calls closed‑data models “open”, commenters fear a regulatory loophole for opaque, high-risk systems.
  • Disagreement over whether OSI’s definition already matches emerging legal usage, or actively reshapes it.
  • Questions raised about liability when users can’t alter training data but only tweak weights.

Ethical, Safety, and Auditability Concerns

  • Critics say you can’t meaningfully audit safety, bias, or test contamination without training data and alignment details.
  • Others reply that current architectures are barely explainable even with full data, but concede data still matters for spotting bias, illegal content, and benchmark leakage.
  • Security worries include undetectable backdoors in models and the impossibility of robustly auditing huge weight blobs.

Does “Open Source” Even Fit AI?

  • Some argue the concept doesn’t map: AI has no human-readable “source code” equivalent; weights are opaque; openness might be better framed in terms of “data commons” or Creative Commons–style licensing.
  • Others think the Open Source Definition could be extended to data and models with minimal changes, but warn against destabilizing a 25‑year‑old concept.

Proposed Alternatives / Terminology

  • Suggestions:
    • Use terms like “open weights” instead of “open source AI” when data isn’t public.
    • Maintain a clear split between “open source” (with data) and weaker labels (without).
    • Add new AI‑specific open licenses, rather than a single grand definition.
  • Some foresee a substantive split between “open source” and “free software” for AI, ending the usual F/OSS umbrella.