Open source AI

A brief-ish explainer of open source artificial intelligence.

Posted: January 27, 2025

Estimated read time: 11 minutes

What is it? ✨ what is it link

Open source artificial intelligence means an AI system is fully accessible and freely available to review and re-purpose.

This includes:

Data
Frameworks
Language models
Development tools

Why it matters 🚨 why it matters link

As AI becomes more integrated into our everyday lives, it’s important to understand how these systems are trained and the algorithms used to make decisions or generate information.

It’s also important to make AI accessible to build and adopt beyond just those that have the financial capacity to do so.

This is especially important for governments, who may use these systems to make decisions that impact the general public.

Pros 🌟 pros link

Open source AI supports:

Safety/security
Accessibility
Transparency
Innovation
Collaboration
Cost savings

Cons ⚠️ cons ️ link

Those against open source AI argue:

It’s less safe.
There’s little to no support.
Weaker quality controls means poor models being used.

License 🪪 license link

Traditional open source licenses apply to software. These include:

But because there are many moving parts to AI (code, data, models), it requires a deeper definition.

Open Source Initiative:

“AI and specifically machine learning systems are not simply software programs but they blend boundaries with data, configuration options, documentation and new artifacts, like weights and biases.”

The definition 📖 the definition link

Like open source software, there are many interpretations of open source AI that lead to confusion (and open washing).

OSI is leading an effort to co-design an open source AI definition (aka OSAID) based on the foundational open source definition. Version 1.0:

An AI system made available under terms and in a way that grant the freedoms to:

Use the system for any purpose and without having to ask for permission.
Study how the system works and inspect its components.
Modify the system for any purpose, including to change its output.
Share the system for others to use with or without modifications, for any purpose.

This includes:

Data (used to train the system)
Code (used to run the system)
Parameters (weights, configurations)

Weights 🏋️ weights ️ link

Not fully satisfied with OSAID, Open Source Alliance unveiled a draft Open Weight Definition (aka OWD) to address access to model weights.

Sam Johnston, Open Source Alliance:

“We opted to base the draft Open Weight Definition on the tried and tested Open Source Definition because it addresses vendors who aspire to label their products as Open Source but are not yet willing or able to deliver the data — and for AI, the data is the source.”

It’s complicated 🙀 its complicated link

The authors of “Open (For Business): Big Tech, Concentrated Power, and the Political Economy of Open AI” sum up the confusion around open source AI:

“We find that the terms ‘open’ and ‘open source’ are used in confusing and diverse ways, often constituting more aspiration or marketing than technical descriptor, and frequently blending concepts from both open source software and open science.”
“This complicates an already complex landscape, in which there is currently no agreed on definition of ‘open’ in the context of AI, and as such the term is being applied to widely divergent offerings with little reference to a stable descriptor.”

For example, Meta’s Llama is licensed under a Meta-created open source license.

Meta:

“We agree with our partner the OSI on many things, but we, like others across the industry, disagree with their new definition.”
“There is no single open source AI definition, and defining it is a challenge because previous open source definitions do not encompass the complexities of today’s rapidly advancing AI models.”
“We make Llama free and openly available, and our license and acceptable use Policy help keep people safe by having some restrictions in place.”

Mozilla:

“We acknowledge that some may disagree with aspects of OSI’s definition, such as its treatment of training data, and that the definition will need refinement over time.”
“However, we believe that the OSI’s community-driven process — which involved over a year of stakeholder engagement — has established a crucial reference point for discussions on open source AI.”

Plus ethics 😇 plus ethics link

Digital Public Goods Alliance and UNICEF created a community of practice “to examine the intersection of ethical AI and open source.”

“Through collaboration and knowledge sharing, this group will delve into the complex considerations associated with AI development, its democratisation (i.e open-sourcing underlying data and models), and identify strategies for how responsible open-source AI systems can be digital public goods.”
“It is crucial to take action in order to establish a framework that guides developers, funders, and implementers in their work on open-source AI, harnesses its potential for positive impact, and ensures that AI digital public goods contribute to a more inclusive and sustainable future for the whole world.”
“The CoP needs to explore measures for responsible AI, including how to provide incentives and motivation to instill responsible practices and norms for use in open-source AI communities to prevent harm and how to adopt novel technical approaches to monitor downstream use.”

Free AI 🦅 free ai link

The Free Software Foundation is working on freedom in machine learning applications.

FSF:

“The FSF’s conclusion is focused mainly on what must be distributed to users of an ML application so that they are able to control their own computing.”
“Such an ML application could be called a free (or libre) machine learning application.”
“All software included in a free ML application has to offer every user the four freedoms that define free software.”
“Given our current understanding of ML applications, we believe that we cannot say a ML application is free unless all its training data and the related scripts for processing it respect all users, following the four freedoms.”
“ML applications that do not offer the four freedoms to all users are, by definition, nonfree, even if their software components are free.”

And, FSF says Meta’s Llama isn’t free.

Open source AI models open source ai models link

The Linux Foundation AI + Data project has developed an open source AI models definition to ensure “consistent understanding and collaboration across our community and the open source ecosystem.”

Deeper: The Model Openness Framework

Open washing AI 🧼 open washing ai link

Like open source software, not all open source AI is created equal. Those that socialize themselves as open source may not meet the full criteria outlined by OSI standards.

So, like greenwashing – marketing a product as ‘green’ when it’s not – there will be inevitable claims of AI open washing.

Pilot 🛫 pilot link

Led by the U.S. National Science Foundation, the National AI Research Resource Pilot is a “vision for a shared national research infrastructure for responsible discovery and innovation in AI.“

One focus area – NAIRR Open – supports “open AI research by providing access to diverse AI resources via the NAIRR Pilot Portal and coordinated allocations.”

The AI Now Institute is skeptical:

“While the NAIRR claims to ‘democratize’ access to AI resources as a way to contend with tech companies’ concentrated power, in reality it would work to expand and entrench the power and control these companies.”

Open training data 📊 open training data link

OSI cites four classes of data required for open source AI:

Open
Public
Obtainable
Unshareable non-public

Data must include:

the complete description of all data used for training, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies
a listing of all publicly available training data and where to obtain it
a listing of all training data obtainable from third parties and where to obtain it, including for fee

Code 🧑‍💻 code link

Open source programming languages used to build AI:

Data 📈 data link

Open source AI datasets:

Frameworks 🖼️ frameworks ️ link

Open source frameworks:

Community 🤗 community link

Hugging Face, considered the “GitHub for AI,” is a platform for hosting datasets, models and applications.

“We are on a mission to democratize good machine learning, one commit at a time.”
“We build open-source resources to empower you to easily integrate AI into your products and workflows.”
“We are convinced that AI can be accessible, optimized, and responsible.”

Government agencies/projects on Hugging Face:

Kaggle is a community “21M+ machine learners” that hosts competitions, datasets, code and models.

And, of course, GitHub and GitLab host repositories for open source AI development.

Beyond open: Public AI 🛣️ beyond open public ai ️ link

Like public infrastructure and utilities – roads, electricity – the authors of the paper, “An Alternative to Regulation: The Case for Public AI,” advocate for “publicly accessible AI models funded, provisioned, and governed by governments or other public bodies.”

“Public AI projects share a goal of creating AI systems that are built, governed, and operated in accordance with shared values and for the public’s benefit.”
“They also provide a mechanism to ensure public funding for AI research results into public benefit rather than private capture.”
“Open code and open weights do not guarantee transparent access to model outputs, but publicly provisioned models can.”

Note:

This is not a licensing model, but more of a collaborative way of building AI between public and private sectors.
“Ultimately, Public AI is an approach, a vision, an ethos.“

What they say 💬 what they say link

IBM + NASA:

“IBM and NASA ❤️ Open Source AI”
“By embracing the principles of open AI and open science, IBM and NASA are actively contributing to the global mission of promoting knowledge sharing and accelerating innovations.”

OSI:

“For AI, society needs at least the same essential freedoms of Open Source to enable AI developers, deployers and end users to enjoy those same benefits: autonomy, transparency, frictionless reuse and collaborative improvement.”

Mark Zuckerberg, Meta:

“I believe that open source is necessary for a positive AI future.”
“Open source will ensure that more people around the world have access to the benefits and opportunities of AI, that power isn’t concentrated in the hands of a small number of companies, and that the technology can be deployed more evenly and safely across society.”

Jason Perlow, ZDNET:

“Supporting open-source AI communities will be essential for promoting ethical and innovative AI developments, benefiting individual projects, and advancing technology responsibly.”

This just in 🚨 this just in link

DeepSeek’s open source approach and claims of a “significant breakthrough in inference speed” has put the tech industry (and the market) on high alert.

More to come on this front, but another indicator that we’re just at the beginning of open source AI.

To be continued 💭 to be continued link