Open source AI: Why the tech industry is arguing about what it even is

“Open source” seems to be the new buzzword in AI. Companies like Meta and Google feel obliged to develop open-source language models, while Elon Musk is suing OpenAI because it doesn’t want to release GPT-4 and its successors. At the same time, there is a growing number of start-ups and celebrities from the AI scene who are positioning themselves as open source advocates. The fundamental problem: Nobody can agree on what “open source AI” even means – and that could have crucial importance for the future of the industry, and possibly all of humanity.

At first glance, open source AI promises a future in which everyone can participate in the development of cutting-edge technology. This could accelerate innovation, increase transparency and give users more control over systems that could soon change many aspects of our lives. But what does that even mean? What makes an AI model open source – and what doesn’t? Until the tech industry agrees on a definition, powerful corporations can easily bend the concept to suit their own needs – and it could even become a tool that consolidates rather than limits the dominance of today’s leading players.

The Open Source Initiative (OSI) plays a kind of referee. It is considered the guardian of open-source thought. The nonprofit organization, founded in 1998, has established a widely accepted set of rules that determine whether software can be considered open source or not. The group recently brought together a 70-person team of researchers, lawyers, policymakers, activists and representatives from major technology companies such as Meta, Google and Amazon. Together they want to develop a working definition for open source AI.

The open source community is very diverse. It encompasses virtually all levels, from small hacktivists to Fortune 500 companies. While there is broad agreement on the overarching principles, says Stefano Maffulli, Managing Director of the OSI, it is becoming increasingly clear that “the devil is in the details.” Ergo: With so many competing interests, it is no easy task to find a solution that satisfies everyone – and at the same time guarantees that the largest companies play a fair game. The lack of a clear definition has hardly prevented corporations from adopting and stretching the term.

Fuzzy criteria

In July last year, for example, Meta made its Llama 2 model, which the company itself describes as open source, freely accessible and has since published several other AI tools in the same way. “We support OSI’s efforts to define open source AI,” said Jonathan Torres, Meta Legal Vice President for AI, Open Source and Licensing. They look forward to continuing to participate in this process “for the benefit of the open source community around the world.” This is in clear contrast to its competitor OpenAI, which has revealed fewer and fewer technical details about its leading models over the years, always citing security concerns. “We only release powerful AI models after we have carefully weighed the benefits and risks,” said a spokesman. This applies to the possibilities for abuse and effects on society.

Other leading AI companies such as Stability AI and German firm Aleph Alpha have also released models described as open source, while Hugging Face offers a large library of freely available AI models. Google offers its most powerful models such as Gemini and PaLM 2 in a closed manner, but has now made Gemma freely accessible. It is designed to compete with Metas Llama 2. However, Google does not call Gemma “open source”, instead the model is “open”, according to the internet giant.

There is considerable disagreement about what open means here. First of all, both Llama 2 and Gemma come with licenses that limit what users can do. This is a fundamental contradiction of open source principles: one of the key clauses of the OSI definition prohibits the imposition of restrictions based on use cases. And the criteria are rather vague even for models that are not linked to such conditions. The concept of open source was ultimately developed to ensure that developers can use, examine the source code, modify and distribute software without restrictions. However, AI systems work fundamentally differently. Key concepts from the open source industry cannot therefore be easily transferred to artificial intelligence, says Maffulli.

One of the biggest hurdles is the sheer number of technical components included in today’s AI models. All you need to tinker with normal software is the underlying source code. But depending on the objective, working on an AI model can include access to the pre-trained model, its training data, or the source code for pre-processing this data. There is also the code for the training process itself, the architecture underlying the model, and a variety of other, more subtle details. “Which components you need in order to sensibly overview and change models is left to interpretation. But we have established which basic freedoms or rights we want to exercise,” says Maffulli. But the implementation is still unclear.

A whole ecosystem

Resolving this debate will be crucial if the AI community wants to reap the same benefits that developers have gained from “normal” open source software, says the OSI boss. This is based on a broad consensus about the meaning of the term. “One [Definition], which is respected and accepted by a large part of the industry, creates clarity,” he says. And clarity means lower costs in complying with such open source regulations, less friction and a common understanding of the technology. The problem: That’s probably enough not. “The biggest sticking point by far is the data. All major AI companies have simply released pre-trained models without the datasets they were trained on.” For those who advocate for a stricter definition of open source AI, this significantly limits its use. Some even say that this is no longer open source.

Other members of the community argue that a simple description of the data is often enough to understand a model. You don’t necessarily have to train it from scratch to make changes. Finished models are already routinely adjusted through a process known as fine-tuning, in which they are sometimes additionally trained on a smaller, often application-specific data set. Metas Llama 2 is a good example of this, says Roman Shaposhnik, CEO of open source AI company Ainekko and vice president of legal affairs at the Apache Software Foundation, which is involved in the OSI process. Although Meta only released a pre-trained model, a thriving community of developers downloaded the model, customized it, and then shared their changes with others. “People use it in all sorts of projects. There’s a whole ecosystem around Llama 2,” he says. “So we have to redefine it. Is it ‘half open’ perhaps?”

It may be technically possible to adapt a model without the original training data. But it is not in the spirit of open source to restrict access to an important component of software, says Zuzanna Warso, research director at the non-profit organization Open Future, which is also working on the OSI definition. It is also questionable whether one really has the freedom to study a model in more detail without knowing what information it was built on. “It’s a crucial part of the whole process,” she says. “If we care about openness, we should also care about the openness of the training data.”

To home page

Open source AI: Why the tech industry is arguing about what it even is

Fuzzy criteria

A whole ecosystem

Share this:

Related

Darkspace – Dark Space -II

Corona news in the ticker: Saxony-Anhalt is planning to take out new emergency loans

You may also like

Leave a Comment Cancel Reply