If the answer is "Neck bone connected to the head bone" im going to want to see a source that isnt Dem Bones
If you cannot re-create the weights and model used for inference, a release's value is somewhat limited vs releases where the inference model can be re-created. (It's kind of like the limited value of scientific papers where the results cannot be reproduced due to a lack of detail)
I suspect a lot of people asking for training data are mainly looking to complain about some aspect of it (bias, copyright, etc etc) instead of actually thinking they can somehow use it to devine how the model will perform.
You can't use this to prove that the model will always behave correctly (or desirably). At best, you can build test-suites to empirically check that it kinda-sorta appears to be doing the right thing most of the time. Which you can just as easily do with a black-box model.
It's not that I'm against openness. I just don't see how you can posit that it gets us close enough to safety.
It is not sufficient but it is necessary.
I wouldn't feel any more comfortable getting diagnosed by an open-source LLM than I would be by a proprietary one made by OpenAI.
For example: >In the rush to deploy off-the-shelf proprietary LLMs, however, health-care institutions and other organizations risk ceding the control of medicine to opaque corporate interests. Medical care could rapidly become dependent on LLMs that are difficult to evaluate, and that can be modified or even taken offline without notice should the service be deemed no longer profitable
Even:
>LLMs often generate ... convincing outputs that are false
is already a problem the medical community has to address with existing tests.
Or:
>Another problem specific to proprietary LLMs is that companies’ dependency on profits creates an inherent conflict of interest that could inject instability into the provision of medical care.
Seemingly applies to essentially the entirety of medical supplies and medications.
Whatever flavour of AI needs to be deterministic, which llama, et al are not. even if you turn the temperature right down.
As others have pointed out, its the training set that actually makes a model behave, hence why models are freely given away by large companies.
Moreover, if its not deterministic, how can you assess if its safe? Sure you can run many many iterations, but how do you know when its safe enough? LLMs encourage freeform entry, which means the testing space is fucking massive.
Does writing with a different syntactic style give different outcomes? Does spelling mistakes lead to increase morbidity? thats a test plan I don't want to have to run (unless you are paying me megabucks.)
Humans are not deterministic as you point out, which is why you need to control for as many unknowns as possible when testing in a health setting.
In models, the training data (dataset) is frequently "closed", where it is not open to viewing. That's just the default behavior of publishing models. You don't need the dataset to use the model. The weights or tensors may be "open" in that we can see them, but they are fairly "not worth viewing" if we don't know the nature of the relationships between the tensors.
If we were able to figure out relationships between the tensors, and the dataset was not made open, then there might be a debate on whether or not certain use of that extracted or "transfer" knowledge is allowed.
For a "model" to be fully "open", it must publish the data it was trained on, the code it used to train itself, and its tensors or weights must not be encrypted or disallow establishing relationships in the weights.
With federated learning and homomorphic encryption, can we satisfy both parties?
I am not sure who will take an AI through regulatory procedures if it is open source and there is no way to make money from it.
Open source is a useful tool for research yes. More of it would be nice.
But I don’t understand how or why anyone is going to go through all the hurdles of deploying technology if all of it is open source.
Maybe an open source enthusiast can explain to me how that is supposed to work?
Many consumers, myself included, will -only- pay for technology if it is open source. In fact if something is proprietary I feel I am being cheated anyway and I might as well pirate it until I find something open to support.
Many of us are willing to pay for time and labor and to support development for our personal projects and businesses so long as we have the power to change that relationship later and keep the tech if third party company later goes evil or goes under.
If I do not have the source code, I do not own it. If I cannot own it, then why pay for it.
Also everything becomes open source eventually. Companies can choose to accelerate this and earn community goodwill that might make them money selling open source turn-key services, or be replaced by that same community eventually doing it all themselves.
No one pays for a license fee for the Linux kernel, but they pay their choice of cloud provider to host it. Choice. That is what I will pay for.
https://staltz.com/time-till-open-source-alternative.html
https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...
This is not a model that can work with regulated medical products. There is a very significant cost to maintaining static artifacts and I don’t see how you can defensively do that if anyone can access the artifacts?
Not just biomedical research but all of science. The effort is being managed out of Argonne National Laboratory by Rick Stevens.
https://www.anl.gov/article/new-international-consortium-for...