Wednesday, August 27, 2025

Mandatory open-sourcing

A thought experiment: What if every sufficiently expensive machine learning model was required to immediately be open-sourced? This would mean that weights, code for running the model, and comprehensive details about the training procedure would be made available to everyone. Perhaps also the training data. Sufficiently expensive could mean a model that cost a million dollars or more to train.

AI safety people should love this idea, because it removes the race dynamic. OpenAI, Anthropic, Google, and their ilk would no longer be locked in a race to develop the biggest and best model, because there would be no obvious economic benefit to pushing the frontier when everyone would immediately have access to your shiny new model. Yes, curiosity-based research would continue (as it should) but there would be no economic sense in investing billions in it. So foundation model development would slow down. From my reading of the room, very many would think this would be a good idea. Even most of the people doing the foundation model training.


Mandatory open-sourcing should also improve safety and security generally. It is not a coincidence that most cybersecurity stacks build on open source software. When everyone has access to the software and can probe it in their own ways, security problems are easier to find. The same should reasonably be true for foundation models. The current situation, where the companies who develop a foundation model retain exclusive access to the weights, does not guarantee safety or security in any way. The foundation model developers do not have all the relevant expertise in the various ways a model could pose safety problems, and they do not have aligned incentives.


Of course, researchers of all stripes would love an open source mandate. We love to take things apart, poke at them from unexpected directions, and find things we weren't sure we were looking for. Lots of good ideas come from this kind of poking around, and lots of understanding as well.


The most important argument for mandatory open-sourcing, however, is the moral argument. Large language models and other foundation models derive their power from what they were trained on, and what they were trained is most of humanity’s cultural output. So their power comes from us. The leading LLMs have almost certainly learned from something you’ve written, unless you are a pure lurker who never posts anywhere. So you should co-own these models with me and billions of other people. They were made from humanity and belong to humanity.


Is this communism? No, it’s a butterfly. Seriously though, I think this is eminently compatible with a capitalist system. By making a key infrastructure layer (foundation models) open to all, we unleash complete freedom in the application layer. Anyone can host these models, tune them, and modify them any way they like–and make money on the products they build on top of the models. You could therefore see mandatory open-sourcing as a pro-competition policy.


What if someone uses an open-sourced model to help develop a new virus or bomb or something? That would be bad. But the situation would not be markedly different from today, when the best open-sourced models are approximately three to six months behind the best closed-sourced models, capability-wise. And remember, there is no actual new knowledge in these models. If the model knows about something, that information is available somewhere else as well. Typically in the scientific literature.


An open source mandate would ideally need an international agreement to back it up. But that really only requires the USA to start by implementing this mandate unilaterally. The Chinese frontier model developers open-source their best models anyway, and have less training hardware, so China should be happy to sign an agreement if the US does. And no other country currently hosts frontier model developers. For the international agreement to be successful, you don’t even need all developed countries to sign up, you just need the vast majority of the world’s GDP represented. Not being able to sell access to your closed-source models in most countries would make development of large closed-source models a waste of money.


Now, enforcing a mandate that expensive models are open-sourced might seem very hard. What’s stopping a rich company from training a giant and expensive model and simply not telling anyone about it? Economics, mostly. At least to the extent that your business model to some extent relies on selling access to the model, however indirectly.


Which brings us to an alternative, or perhaps rather complementary, means of achieving essentially the same goal, which is through legal liability. There are a number of ongoing court cases regarding the liability of model developers and access providers in cases of copyright infringement and other types of damages or injuries, such as misinformation or even incitement to suicide. What we could do here is to have tougher liability requirements for closed source models. Or place all the liability with the model developer for closed source models, but leave it with the entity that sells access to the end user in case of open source models. In either case, the effect would be to make it severely economically unappetizing to develop a frontier model and not open-source it.


Alas, I am under no illusions that an open source mandate will actually happen. Too many billions have been invested in closed source model developers, and a dominant stream of AI safety thinking has convinced much of the field that safety through obscurity is the way to go. So I'm really just posting this here as a thought experiment. Your token usage may vary.

No comments:

Post a Comment