Sure. But that's not going to be released. The term open source AI cannot be expected to cover it because it's not practical.
Synthetic part of the training data could be released.
Who? It's not their data.
Open training data is hard to the point of impracticality. It requires excluding private and proprietary data.
Meanwhile, the term "open source" is massively popular. So it will get used. The question is how.
Meta et al would love for the choice to be between, on one hand, open weights only, and, on the other hand, open training data, because the latter is impractical. That dichotomy guarantees that when someone says open source AI they'll mean open weights. (The way open source software, today, generally means source available, not FOSS.)
Here's the source of the disagreement. You're justifying the use of the term "open source" by saying it's logical for Meta to want to use it for its popularity and layman (incorrect) understanding.
Other person is saying it doesn't matter how convenient it is or how much Meta wants to use it, that the term "open source" is misleading for a product where the "source" is the training data, and the final product has onerous restrictions on use.
This would be like Adobe giving Photoshop away for free, but for personal use only and not for making ads for Adobe's competitors. Sure, Adobe likes it and most users may be fine with it, but it isn't open source.
>The way open source software, today, generally means source available, not FOSS.
I don't agree with that. When a company says "open source" but it's not free, the tech community is quick to call it "source available" or "open core".
Right, so the onus is on Facebook/Meta to get that right, then they could call something Open Source, until then, find another name that already doesn't have a specific meaning.
> (The way open source software, today, generally means source available, not FOSS.)
No, but it's going in that way. Open Source, today, still means that the things you need to build a project, is publicly available for you to download and run on your own machine, granted you have the means to do so. What you're thinking of is literally called "Source Available" which is very different from "Open Source".
The intent of Open Source is for people to be able to reproduce the work themselves, with modifications if they want to. Is that something you can do today with the various Llama models? No, because one core part of the projects "source code" (what you need to reproduce it from scratch), the training data, is being held back and kept private.
you are playing very loosely with terms that have specific, widely accepted definitions (e.g. https://opensource.org/osd )
I don't get why you think it would be useful to call LLMs with published weights "open source"