The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
IANAL, but this basically sounds like LLaMa was trained on illegally obtained books by Meta's own admission. It's an exciting development that Meta is releasing a commercial-use version of the model, but I wonder if this is going to cause issues down the road. It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).
And they will have much better knowledge, answers, etc than the western, Lawyer approved models.
Sometimes knowledge needs to be set free I guess.
At this point with the quality of current web content and the collapse of journalism as an industry I think we can say online ads have utterly failed as a replacement income stream.
Unless you want all LLM to say “I’m sorry the data I was trained on ends in 2023” you still need a content funding model. Maybe not copyright, but sure as hell not ads either.
By some definition of "worked". If we define "worked" as "made money for", who it worked mostly for are the middlemen and a minority of writers... a minority that with the advent of LLMs is likely to shrink even further.
It was probably intended that way, but the reality is that the power has been with the publisher since the beginning, and they've absolutly been screwing over the author's as well. Only the most successful author's have gotten decent deals.
I don't have an answer to this either though, i just wanted to point out that copyright has arguably never been successful at getting money to the content creators proportional to the value the Publisher extracted from the work either.
Journals get their content for free. Actually often they charge the authors for it.
Research is mainly funded by governments and taxes.
Put differently, we consider -- but don't think a whole lot about -- about Wikipedia's "funding," because that's NOT the most important part/innovation of that model.
We should better answer what is?
Can you give some examples of new knowledge that was copyrighted? Generally copyright is used to protect art, software and textbooks. People who produce new knowledge generally are not paid by copyright. The knowledge is either kept secret or published in a journal from which the author recieves no compensation.
Since the company is obtaining + providing these models with 100% of their input data, it could be argued they have some responsibility to verify the legality of their procurement of the data.
its in a weird place imo, with japan ruling that anything goes for AI data, other countries are put under pressure to allow the same
ie,
you're allowed to scrape the web
you're allowed to take what you scrape and put it in a database
you're allowed to use your database to inform on decisions you might make, or content you might create
but once you put AI model in the mix, all of a sudden there's problems, despite the fact that making the model is 10000% harder than doing all of the points mentioned above, the problem of using someone else's work somehow becomes a problem when it never was before
and if truly free and open source LLMs come into the game, then might the corporate ones become crippled from copyright? that's bad for business
They probably can:
https://github.com/zjunlp/EasyEdit
> I wonder if this is going to cause issues down the road.
There are some popular Stable Diffusion models, being run in small businesses, that I am certain have CSAM in them because they have a particular 4chan model in their merging lineage.
... And yet, it hasn't blown up yet? I have no explanation, but running "illegal" weights seems more sustainable than I would expect.
Not that I am disagreeing with you. What I find particularly disturbing are the paid services for this.
Also, I have seen 2 seperate OnlyFans pimps ask for help in a text generation chatroom. Something about automating "private" texting from their "girls."
No, actually they probably can’t. There is no verifiable way to remove the data from the model apart from completely removing all instances of information from the training data. The project you linked only describes a selective finetuning approach.
... But yeah, fundamentally the only way to throw out the books is to throw out the weights.
Virtually every discussion in the LLM space right now is almost immediately bifurcated by the "can I use this commercially?" question which has a somewhat chilling effect on innovation. The best performing open source LLMs we have today are llama-based, particularly the WizardLM variants, so giving them more actual industry exposure will hopefully be a force multiplier.
The network architecture itself is not source code, but a rough specification constraining the optimizer, which searches for possible program descriptions that within the specified constraints, minimize some loss function with respect to the data.
Neither data nor network architecture are the actual source, they are better seen as recipes which if followed (will at great expense), allow finding behaviorally similar programs. As you can see, the standard ideas of open source don’t quite carry over because the actual "source-code" is not human interpretable.
Thats totally fair. And you're correct in that I was making an argument for positive outcomes being orthogonal to the semantics distinction.
> I also believe that actual open source models have the near-term opportunity to make an impact and shape the future landscape, with red pajamas and others in the works. The distinction could be very important in the near term, at the rate this field is developing at.
I think Falcon and MPT support your point as well, but those are still models that were trained on very small budgets relative to llama or gpt-3/4. There's a clear quality delta, albeit that gap is closing. Through that lens, I think having a large, well-funded org doing the pre-training work for the OSS community and releasing the weights permissively is a net positive.
In your scenario, despite the unrealistic coding process, the machine code is the source code, because that's what everyone is working on.
In the development of LLM, the weights is in no way the preferred form of development. Programmers don't work on weights. They work on data, infrastructure, the model, the code for training, etc. The point of machine learning is not to work on weights.
Unless you anthropomorphize optimizers, in which case the weights are indeed the preferred form of editing, but I had never seen anyone---even the most forward AGI supportors---argue that optimziers are intelligent agents.
You release your weights, others can build on top of that, fine tune it in different ways, produce new weights they can share with others. Seems very OSS-y.
I feel like there is some semantic nitpicky point being made here that is completely going over my head.
In a rough way, a NN is just a compiler designed to translate a boatload of simple data into a useful program that operates on similar data.
Content is a complement to a social network: the cheaper it is to create content, the more content is available, the easier it is to optimize a feed, the larger the time people spend in the platform, the higher the revenue. GenAI is just a method to drive the cost of content creation to zero.
It’s really the ultimate nightmare with the internet becoming just TV 3.0 in which content is controlled and curated … you just consume mindlessly.
Any attempts to create a Reddit clone.. or system in which people freely communicate is now “regulated” for “hate” speech or “terrorism”. The days of open discourse … appear to be numbered. Even email will be analyzed by AI to look for “trends” or “optimize” employee efficiency.
It really is time for a new internet.
> ... Infinite Jest, also called "the Entertainment" or "the samizdat". The film is so compelling that its viewers lose all interest in anything other than repeatedly viewing it, and thus eventually die.
https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/
From the FT article: '“The goal is to diminish the current dominance of OpenAI,” said one person with knowledge of high-level strategy at Meta.'
This is not charity, this is a shrewd business move.
My guess is still the latter because that's what I've heard the rumors about, but this article is pretty unclear on this fact.
I don't think any business would run such a "licensed" model over MPT 30B or Falcon 40B, unless its way better than LLaMA 65b.
How can I play with open source LLM's locally?
You can leverage those big CPUs while still loading both GPUs with a 65B model.
... If you are feeling extra nice, you should set that up as an AI horde worker whenever you run koboldcpp to play with models. It will run API requests for others in the background whenever its not crunching your own requests, in return allowing you priority access to models other hosts are running: https://aihorde.net/
https://huggingface.co/ycros/airoboros-65b-gpt4-1.4.1-PI-819...
Check the prompting syntax here, it has a huge effect on the output:
It works for 7B/13B/30B/65B LLaMA and Alpaca (fine-tuned LLaMA which definitely works better). The smaller models at least should run on pretty much any computer.
Also, it has no "1 click" exe release like kobold.
I originally had 2 2080ti's to experiment also with virtio/proxmox (you need 1 for the host and 1 for any VM you run). I never got that running successfully at the time, but then Proton got really good (I mainly just wanted to run windows games fast in a VM, but that circumvented that). Later on I upgraded one of them to a 3080ti.
It's a System76 machine, they make good stuff
Well now there is a commerical release. I guess it wasn't some corporate plot after all!
Some people just can't admit when a corporation does a good thing.
(In this case, the good thing is being done to obsolete their competitors, but it is good none the less, that a commerical LLM is available for people to use for free)
Still waiting for the 'Meta is dying' and 'Fire Mark Zuckerberg' calls from last year. A year later, where are they now?
Does it mean that any blogs that I wrote from my own insights, will automatically be trained on the model… without my permission?
As an author, it feels like it’s stealing the knowledge and insight without appropriate attribution.
It seems like the existing large platforms of today—Microsoft’s enterprise moat, Google’s ads and internet services, Meta’s social networks, Apple’s consumer and mobile products—will remain the primary platforms of the future. So having models that can operate exclusively on those platforms via integration to their key products and date will only continue this trend. If you’re an outsider with an AI model, you’ll have a harder time getting access to critical data and your standalone AI product (e.g., ChatGPT) won’t be as useful.
More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over. The FAANGs just control so many major platforms in so many aspects of our lives.
It also helps that they buy or otherwise cooperate to destroy their competition in questionable ways while heavily lobbying the gov to favor them over others in a quid-pro-quo that benefits politicians and not their constituents.
I disagree: I think big tech is hard to disrupt ATM because the companies are still young and nimble. In the last cycle, the companies being displaced were ancient (by tech standards). When Google and Facebook are 30 years old, their DNA will get in the way of adopting to a new paradigm that will change the world. A paradigm that may be to the Metaverse what the smartphone was to the Apple Newton
Maybe that's Meta's play here? Maybe the idea is that the ecosystem around a model could be as valuable or more valuable than the model itself too, so an OSS model could benefit Meta a lot more by gaining more of the ecosystem mind share?
Or Maybe Yann LeCun is just a hippie that dreams of free love, hard drugs and open-source models?
They might have done well to make gg an offer he couldn't refuse and take on ggml and llama.cpp as an open source project.
Facebook benefits heavily from the open source development done on LLaMA. There was a report I saw that facebook has started using llama.cpp internally for inference. Updates to the licensing will cement facebook as the go to choice for open source language models.
My hypothesis based on the context of Mark discussing the release is that it's going to be completely open source and can licensed to be used commercially. Not that Meta is going to add a whole new revenue side of business to compete with OpenAI. i.e. "Here is model, with commercially permissive licensing" not "Here is model that you can use commercially but must pay me"
https://www.youtube.com/watch?v=Ff4fRgnuFgQ&ab_channel=LexFr...
They can even write it as 'good will' on their financial statements.
It kind of is working.
It might feel like "brand rehab" or "good will" as a consumer, but a lot of this work was put in motion a while ago.
This seems they will release the weights under some license that allows commercial usage.
How they monetise it (which I assume they will try and do?) is an interesting question.
Maybe some variant of paying a licencing fee?
There doesn't necessarily have to be one. Facebook's goal may be to help commoditize its complements. https://gwern.net/complement
hardware is the only moat
If you want to live the good life before you are exquisitely extinguished, spend every other day figuring out how to buy more NVDA, the other days exercising outside, being human.
QLORA is the most cost effective method so far. Some people also do finetuning on Google TPUs
Open-source commercial?
Free as in beer Vs free as in speech and the whole thing.
If you listen to the definition the Open Source Initiative would have applied to the term open source had they succeeded in acquiring rights to the term, then commercial is redundant with open source, not the opposite of it.