Meta to release open-source commercial AI model (opens in new tab)

(zdnet.com)

177 pointsmaskil2y ago159 comments

159 comments

foob2y ago

From the recent story about the Sarah Silverman lawsuit:

The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”

IANAL, but this basically sounds like LLaMa was trained on illegally obtained books by Meta's own admission. It's an exciting development that Meta is releasing a commercial-use version of the model, but I wonder if this is going to cause issues down the road. It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).

[1] https://news.ycombinator.com/item?id=36657540

ramshanker2y ago

Sometimes, I wonder what if someone in XYZ country downloads whole of Z-Library/Libgen, all the books ever printed, and all the papers ever published, all the newspapers and so on. and releases the model open source. There are jurisdictions with Lax rules.

And they will have much better knowledge, answers, etc than the western, Lawyer approved models.

Sometimes knowledge needs to be set free I guess.

TX81Z2y ago

The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things.

At this point with the quality of current web content and the collapse of journalism as an industry I think we can say online ads have utterly failed as a replacement income stream.

Unless you want all LLM to say “I’m sorry the data I was trained on ends in 2023” you still need a content funding model. Maybe not copyright, but sure as hell not ads either.

pmoriarty2y ago

"Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things."

By some definition of "worked". If we define "worked" as "made money for", who it worked mostly for are the middlemen and a minority of writers... a minority that with the advent of LLMs is likely to shrink even further.

1 more reply

4111111111111112y ago

You state this as a fact, but it's actually much less certain wherever it's ever been net-positive.

It was probably intended that way, but the reality is that the power has been with the publisher since the beginning, and they've absolutly been screwing over the author's as well. Only the most successful author's have gotten decent deals.

I don't have an answer to this either though, i just wanted to point out that copyright has arguably never been successful at getting money to the content creators proportional to the value the Publisher extracted from the work either.

1 more reply

IshKebab2y ago

The production of knowledge (I assume you're mainly talking about scientific research here) is absolutely not funded by copyright royalties or anything like that.

Journals get their content for free. Actually often they charge the authors for it.

Research is mainly funded by governments and taxes.

1 more reply

jrm42y ago

But again, "funding" is merely common and/or one step in the process. It's not always necessary and is definitely never sufficient, and I think when you bring it up, the mental model that people have is of the incorrect scale?

Put differently, we consider -- but don't think a whole lot about -- about Wikipedia's "funding," because that's NOT the most important part/innovation of that model.

We should better answer what is?

Vt71fcAqt72y ago

>The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time.

Can you give some examples of new knowledge that was copyrighted? Generally copyright is used to protect art, software and textbooks. People who produce new knowledge generally are not paid by copyright. The knowledge is either kept secret or published in a journal from which the author recieves no compensation.

dogma11382y ago

Training and copyright is going to be interesting, people can be trained on “illegally obtained” books too yet you’ll probably going to be hard pressed to make an argument that any employee who downloaded a book or a paper from “libre library” could be used as fruit of the poisonous tree argument down the line.

l33t2333722y ago

If the company supplied the employee with the “illegally obtained” books, that could be reason to view the situation differently than an employee acting on their own.

Since the company is obtaining + providing these models with 100% of their input data, it could be argued they have some responsibility to verify the legality of their procurement of the data.

stainablesteel2y ago

its not deemed illegal yet

its in a weird place imo, with japan ruling that anything goes for AI data, other countries are put under pressure to allow the same

ie,

you're allowed to scrape the web

you're allowed to take what you scrape and put it in a database

you're allowed to use your database to inform on decisions you might make, or content you might create

but once you put AI model in the mix, all of a sudden there's problems, despite the fact that making the model is 10000% harder than doing all of the points mentioned above, the problem of using someone else's work somehow becomes a problem when it never was before

and if truly free and open source LLMs come into the game, then might the corporate ones become crippled from copyright? that's bad for business

brucethemoose22y ago

> It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).

They probably can:

https://github.com/zjunlp/EasyEdit

> I wonder if this is going to cause issues down the road.

There are some popular Stable Diffusion models, being run in small businesses, that I am certain have CSAM in them because they have a particular 4chan model in their merging lineage.

... And yet, it hasn't blown up yet? I have no explanation, but running "illegal" weights seems more sustainable than I would expect.

Der_Einzige2y ago

I’ve been wondering when the landmark moral panic would start against Civit.AI and the coomer crowd. People have no idea just how much porn is being produced by this stuff. One of the top textual inversions right now is a… age slider… (https://civitai.com/models/65214/age-slider) ewww. It’s also extremely well rated and reviewed on there. I’m terrified at the impending backlash because depending on what happens the party going on in AI could end

brucethemoose22y ago

People have been saying this about underage hand drawn hentai forever, but its still around.

Not that I am disagreeing with you. What I find particularly disturbing are the paid services for this.

Also, I have seen 2 seperate OnlyFans pimps ask for help in a text generation chatroom. Something about automating "private" texting from their "girls."

1 more reply

whimsicalism2y ago

That is not at all the same thing as removing the books.

twayt2y ago

> They probably can:

No, actually they probably can’t. There is no verifiable way to remove the data from the model apart from completely removing all instances of information from the training data. The project you linked only describes a selective finetuning approach.

xnx2y ago

It's an area of active research: https://ai.googleblog.com/2023/06/announcing-first-machine-u...

2 more replies

brucethemoose22y ago

They can probably prevent LLaMA from spitting out verbatim quotes from the books well enough to make proof difficult.

... But yeah, fundamentally the only way to throw out the books is to throw out the weights.

potsandpans2y ago

that is quite the spicy claim

wongarsu2y ago

If we accept the argument that you can train a ML model on data scraped from the internet because the model is sufficiently transformative and thus isn't impacted by the copyright of that data, then how does that change simply because somebody else distributed the data illegally? Either the ML model breaks the copyright chain or it doesn't. Or is the argument that using data that was provided to you in violation of copyright is illegal in general?

_Algernon_2y ago

How is it different than training from random blogs, or stack overflow or in general "The Internet"?

schleck82y ago

Really, really bad look for Eleuther if this is true. I did not expect them do something like this and not even see the issue with it.

Ancalagon2y ago

Move fast and break the law.

andybak2y ago

It's far from certain at this stage whether this does break the law.

1 more reply

innagadadavida2y ago

Copyright laws should be amended to allow this scenario. If I read a book and write about it in a blog, it is considered review. Why shouldn’t we allow companies to do the same to train their models? Overall it will benefit society more than it hurts some rich authors.

5 more replies

TX81Z2y ago

It worked for Uber!

1 more reply

Der_Einzige2y ago

Most large datasets are full of copyrighted content. They aren’t unique.

cameldrv2y ago

It seems difficult to argue that Meta can copy every ebook in existence to train a model, but then other people cannot copy the resulting model.

zargon2y ago

It's not open source, it's freeware or something like that. Weights aren't the source code of LLMs, they're the binaries.

spmurrayzzz2y ago

Maybe this is just semantics, but I don't know if the OSS-vs-freemium distinction matters all that much (I'd have to think about the potential downsides a bit more tbh).

Virtually every discussion in the LLM space right now is almost immediately bifurcated by the "can I use this commercially?" question which has a somewhat chilling effect on innovation. The best performing open source LLMs we have today are llama-based, particularly the WizardLM variants, so giving them more actual industry exposure will hopefully be a force multiplier.

zargon2y ago

Llama isn't open source either. But if I understand your point correctly, you're saying that the commercial use axis is what is important to people, and it's orthogonal to freeware vs open source. In the present environment, I agree. But I don't think we should let companies get away with poisoning the term open source for things which are not. I also believe that actual open source models have the near-term opportunity to make an impact and shape the future landscape, with red pajamas and others in the works. The distinction could be important in the near term, at the rate this field is developing at.

Vetch2y ago

Neural network weights are better viewed as source code because they specify what function the network computes. As we're operating purely on feed-forward networks, there are no loops. Therefore, weights fully describe everything relevant for executing their represented function on inputs. Weights can be seen as a sort of intermediate language (with lots of stored data and partially computed states) interpretable by some deep learning library.

The network architecture itself is not source code, but a rough specification constraining the optimizer, which searches for possible program descriptions that within the specified constraints, minimize some loss function with respect to the data.

Neither data nor network architecture are the actual source, they are better seen as recipes which if followed (will at great expense), allow finding behaviorally similar programs. As you can see, the standard ideas of open source don’t quite carry over because the actual "source-code" is not human interpretable.

1 more reply

spmurrayzzz2y ago

> But I don't think we should let companies get away with poisoning the term open source for things which are not.

Thats totally fair. And you're correct in that I was making an argument for positive outcomes being orthogonal to the semantics distinction.

> I also believe that actual open source models have the near-term opportunity to make an impact and shape the future landscape, with red pajamas and others in the works. The distinction could be very important in the near term, at the rate this field is developing at.

I think Falcon and MPT support your point as well, but those are still models that were trained on very small budgets relative to llama or gpt-3/4. There's a clear quality delta, albeit that gap is closing. Through that lens, I think having a large, well-funded org doing the pre-training work for the OSS community and releasing the weights permissively is a net positive.

axus2y ago

Sen. Marsha Blackburn said “fair use” protections have become a “fairly useful way to steal” intellectual property. Some people would like to use this situation to get rid of "fair use".

slimebot802y ago

Forgive my ignorance, but might it matter if a country was hoping to limit another countries advancement into weaponising AI?

whimsicalism2y ago

Strong disagree - I think OSS is fine framing of this. Weights are a third category, you can 'fork' them in an a way that you can't with standard binaries.

l33t2333722y ago

You can add hooks to functions and “fork” binaries, which is a pretty similar effort to adding training data to given model weights.

IshKebab2y ago

Nobody does that because if you only have binaries you probably don't have permission to do that. Plus it's impractical to make any significant changes that way.

1 more reply

williamstein2y ago

Maybe there is no source code? I imagine an LLM is like output of the following process. There's a huge room full of programmers that can directly edit machine code. You give them a random binary, which they then hack on for a while and publish the result. You then inspect it and tell them it isn't quite optimal in some way and ask them for a new version. Iterate on this process a bazillion times. At the end you get a binary that you're reasonably happy with. Nobody ever has the source code.

powersnail2y ago

Source code is the preferred form for development.

In your scenario, despite the unrealistic coding process, the machine code is the source code, because that's what everyone is working on.

In the development of LLM, the weights is in no way the preferred form of development. Programmers don't work on weights. They work on data, infrastructure, the model, the code for training, etc. The point of machine learning is not to work on weights.

Unless you anthropomorphize optimizers, in which case the weights are indeed the preferred form of editing, but I had never seen anyone---even the most forward AGI supportors---argue that optimziers are intelligent agents.

whimsicalism2y ago

What? You work on the weights - you just do it using tools like the optimizers, etc.

You release your weights, others can build on top of that, fine tune it in different ways, produce new weights they can share with others. Seems very OSS-y.

I feel like there is some semantic nitpicky point being made here that is completely going over my head.

2 more replies

ssd5322y ago

I read it in all such discussions. What does it mean? I just have a very high level understanding of AI models. No idea how things work under the hood or what knobs can be tweaked.

doctoboggan2y ago

The source code is all the supporting code needed to run inference on the weights. This is usually python and in the case of llama it's already open source. Usually the source code is referred to as the "model". You can kind of think of the weights as a settings file in a normal desktop application. The desktop app has its own source code and loads in the settings file at runtime. It can load different settings files for different behaviors.

briancleland2y ago

This is almost completely wrong. When peope who work in AI refer to the "model", they are generally referring to the weights. It is the weights which are the most important determinant of how the model performs, and it is the weights that require the most resources to develop. Associated code and other assets are also important, but they not the core asset. The intuitive sense of open sourcing a model therefore typically means releasing the weights under an open licence (ideally along with the training and inference code, data, training info, etc).

1 more reply

zargon2y ago

Open source is about freedom to modify the product. So in the context of an LLM, the source code is the data and the code that processes the data during *training* (not only inference), as that is what generates the weights.

ssd5322y ago

I thought model is the output of training. It's a binary file black box. That's what I had read somewhere.

1 more reply

hoofedear2y ago

Thank you for succinctly explaining the difference, I learned something today

StackOverlord2y ago

Compiling source code doesn't cost million of dollars though

ynniv2y ago

That doesn't change the meaning of Open Source. These are "free as in beer", not "free as in [modify the sources and rebuild it]". There are LLMs for which that is true, which include a specific list of training data. If you wanted to "uncensor" one of those, you could curate the source data and rebuild it, instead of trying to get it to unlearn what it was taught.

pphysch2y ago

If you had petabytes of highly interconnected source code, it could.

In a rough way, a NN is just a compiler designed to translate a boatload of simple data into a useful program that operates on similar data.

chejazi2y ago

Yea the weights are the secret sauce that OpenAI and competitors generally protect.

greatpostman2y ago

Meta is going to ruin open ais moat on purpose. Great business strategy and good for everyone but metas competitors

jonnat2y ago

Quite the opposite, this is great for Meta's competitors. Meta is not trying to get market share with this strategy, it's trying to commoditize their complements (https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/)

Content is a complement to a social network: the cheaper it is to create content, the more content is available, the easier it is to optimize a feed, the larger the time people spend in the platform, the higher the revenue. GenAI is just a method to drive the cost of content creation to zero.

strikelaserclaw2y ago

kind of a dystopian nightmare world in which large corporations utilize AI to create low cost, infinite content that humans engage with (mostly content catering to the human tendency for tribalism, prestige, sexual desires etc...), sounds like we are creating a world similar to the Matrix.

roody152y ago

I think we may have already entered it. Infinite scroll based feeds like TikTok, Instagram, and Threads (and possible Reddit these days) … just AI algorithm deciding what you should find “entertaining” or “important”.

It’s really the ultimate nightmare with the internet becoming just TV 3.0 in which content is controlled and curated … you just consume mindlessly.

Any attempts to create a Reddit clone.. or system in which people freely communicate is now “regulated” for “hate” speech or “terrorism”. The days of open discourse … appear to be numbered. Even email will be analyzed by AI to look for “trends” or “optimize” employee efficiency.

It really is time for a new internet.

2 more replies

nomel2y ago

https://en.wikipedia.org/wiki/Infinite_Jest

> ... Infinite Jest, also called "the Entertainment" or "the samizdat". The film is so compelling that its viewers lose all interest in anything other than repeatedly viewing it, and thus eventually die.

_Algernon_2y ago

Infinite flame wars. Can't wait!

freedomben2y ago

I wish I could upvote this a dozen times. This is a very insightful comment. Read the link above first if you aren't sure what "commoditize their complements" means.

herodoturtle2y ago

Reminds me of Joel Spolsky’s essay on “Commoditize your complement”:

https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/

ekojs2y ago

Seems that the source is a FT article that was discussed yesterday: https://news.ycombinator.com/item?id=36712168

From the FT article: '“The goal is to diminish the current dominance of OpenAI,” said one person with knowledge of high-level strategy at Meta.'

forgingahead2y ago

Zuck is a total killer. What better way to fight Google and Microsoft than to effectively spawn thousands of potential competitors to their AI businesses with this (and other) releases. There will be a mad scramble over the released weights to develop new tech, these startups will raise tons of money, and then fight the larger incumbents.

This is not charity, this is a shrewd business move.

amelius2y ago

"Commoditize your complement"

whimsicalism2y ago

If you read past the title, this article is not at all clear if they are referring to a commercial offering (ie. license our model for $$) or an open-source license with commercial usage (Apache, etc.)

My guess is still the latter because that's what I've heard the rumors about, but this article is pretty unclear on this fact.

brucethemoose22y ago

Falcon 40B was released as a "free with royalties above a certain amount of profit" license, and got roasted for it. It was so bad that they changed the license to Apache.

I don't think any business would run such a "licensed" model over MPT 30B or Falcon 40B, unless its way better than LLaMA 65b.

whimsicalism2y ago

I think it is supposed to be better than LLaMA 65B. Plenty of businesses are paying for OAI API access.

pmarreck2y ago

I have a 128 core Threadripper, a 2080 Ti and a 3080 Ti.

How can I play with open source LLM's locally?

brucethemoose22y ago

Kobold.cpp is your best bet.

You can leverage those big CPUs while still loading both GPUs with a 65B model.

... If you are feeling extra nice, you should set that up as an AI horde worker whenever you run koboldcpp to play with models. It will run API requests for others in the background whenever its not crunching your own requests, in return allowing you priority access to models other hosts are running: https://aihorde.net/

pmarreck2y ago

oooh, this is a great idea

brucethemoose22y ago

Also, I would suggest this model as one to play with:

https://huggingface.co/ycros/airoboros-65b-gpt4-1.4.1-PI-819...

Check the prompting syntax here, it has a huge effect on the output:

https://huggingface.co/jondurbin/airoboros-65b-gpt4-1.4

estreeper2y ago

If you're just looking to play with something locally for the first time, this is the simplest project I've found and has a simple web UI: https://github.com/cocktailpeanut/dalai

It works for 7B/13B/30B/65B LLaMA and Alpaca (fine-tuned LLaMA which definitely works better). The smaller models at least should run on pretty much any computer.

brucethemoose22y ago

That project seems unmaintained, which is a problem because llama.cpp is changing extremely rapidly.

Also, it has no "1 click" exe release like kobold.

freedomben2y ago

May I ask why you have such an amazing machine, and two nice graphics cards? Feel free to tell me it's none of my business, it's just very interesting to me :-)

pmarreck2y ago

Career dev who had the cash and wanted to experiment with anything that can be done concurrently, such as in my language of choice lately, which features high concurrency (https://elixir-lang.org/) or these LLM's, or anything else that can be done in massively parallel fashion (which is, perhaps surprisingly, only a minority of possible computer work, but it still means I can run many apps without much slowdown!)

I originally had 2 2080ti's to experiment also with virtio/proxmox (you need 1 for the host and 1 for any VM you run). I never got that running successfully at the time, but then Proton got really good (I mainly just wanted to run windows games fast in a VM, but that circumvented that). Later on I upgraded one of them to a 3080ti.

It's a System76 machine, they make good stuff

nickthegreek2y ago

Check out r/LocalLlama for a bunch of resources.

loufe2y ago

I'm surprised nobody here has brought up the sensorship in this model. Listening to Mark Zuckerberg on Lex Friedman's podcast talk about it, it sounds like the model will be significantly blunted vs its "research" version release.

stale20022y ago

I remember arguing with people who honest to god thought that LLAMA was some sort of secret ploy, to trick startups into using it, so that meta could sue them for using it commercially.

Well now there is a commerical release. I guess it wasn't some corporate plot after all!

Some people just can't admit when a corporation does a good thing.

(In this case, the good thing is being done to obsolete their competitors, but it is good none the less, that a commerical LLM is available for people to use for free)

obblekk2y ago

Maybe they've solved the fingerprinting problem and can identify text generated from their model, and this is a way of discovering the market they can sell more advanced models to directly. B2B leadgen...

vlovich1232y ago

I don't think so because I believe you can train AI models against other AI models. I believe you can fingerprint a family of models, but that's not going to tell whether you just used the general approach outlined in the academic papers.

sva_2y ago

I mean you could probably just train it on some sequence s.t. the model identifies itself, would be hard to detect that

sebzim45002y ago

That would prbably work to detect if e.g. OpenAI or Anthropic start using their weights directly. It wouldn't detect whether e.g. a blog was generated with their model or not.

0cf8612b2e1e2y ago

From my quick skim I could not find a date. Any idea when this might happen?

rvz2y ago

See. They don't care about the LLaMA model leak. It turns out that it was OpenAI that cares because it ruins their moat. It costs Meta nothing to release a better open-source or freely available version of LLaMA again.

Still waiting for the 'Meta is dying' and 'Fire Mark Zuckerberg' calls from last year. A year later, where are they now?

sebzim45002y ago

To be fair to those commenters in the past, I don't think anyone could have forseen that Zuckerberg would turn out to be the "the good guy".

TheBengaluruGuy2y ago

This conversation triggers a thought.

Does it mean that any blogs that I wrote from my own insights, will automatically be trained on the model… without my permission?

As an author, it feels like it’s stealing the knowledge and insight without appropriate attribution.

anaganisk2y ago

I think we are at a looking where we just have to let go unless we are Disney, with an army of lawyers. May be it's time for the change in thinking. Having said that. Attribution allows a person to trace the source, it's not a success marker anymore. Probably, if enough negative statements generated by AI get popular, that could potentially piss of countries/people for example some LLM recognizing Taiwan as independent country you can bet China will push for attribution to sources. We have bills pending in multiple countries that want access to personal of encrypted messages to trace the source.

Jeff_Brown2y ago

What's the monetization model here? Is this a closed-source version of their open-source model? (That's suggested by the phrase in the article, "a commercial version of LLaMA, its open-source large language model".)

TechnicolorByte2y ago

Like others said it’s probably to commoditize their competition. The models don’t matter so much as ownership of the platform and critical data. Which is why OpenAI is in a tricky position (although I guess they’re partnered with Microsoft).

It seems like the existing large platforms of today—Microsoft’s enterprise moat, Google’s ads and internet services, Meta’s social networks, Apple’s consumer and mobile products—will remain the primary platforms of the future. So having models that can operate exclusively on those platforms via integration to their key products and date will only continue this trend. If you’re an outsider with an AI model, you’ll have a harder time getting access to critical data and your standalone AI product (e.g., ChatGPT) won’t be as useful.

More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over. The FAANGs just control so many major platforms in so many aspects of our lives.

isaacremuant2y ago

> More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over. The FAANGs just control so many major platforms in so many aspects of our lives.

It also helps that they buy or otherwise cooperate to destroy their competition in questionable ways while heavily lobbying the gov to favor them over others in a quid-pro-quo that benefits politicians and not their constituents.

sangnoir2y ago

> More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over.

I disagree: I think big tech is hard to disrupt ATM because the companies are still young and nimble. In the last cycle, the companies being displaced were ancient (by tech standards). When Google and Facebook are 30 years old, their DNA will get in the way of adopting to a new paradigm that will change the world. A paradigm that may be to the Metaverse what the smartphone was to the Apple Newton

dundun2y ago

Google opensourced Tensorflow because they believed it would help with the hiring process: if researchers could use the same framework to do their PhDs as Google used in their production systems, that was seen as an advantage.

Maybe that's Meta's play here? Maybe the idea is that the ecosystem around a model could be as valuable or more valuable than the model itself too, so an OSS model could benefit Meta a lot more by gaining more of the ecosystem mind share?

Or Maybe Yann LeCun is just a hippie that dreams of free love, hard drugs and open-source models?

andy992y ago

LLaMA is already in it's way to becoming an industry standard (in my opinion, look at llama.cpp plus everything build on LLaMA). There are benefits to being able to set direction like that. Same as pytorch for example, it's not just about direct revenue it's about everyone building on and contributing to your platform.

They might have done well to make gg an offer he couldn't refuse and take on ggml and llama.cpp as an open source project.

valine2y ago

LLaMA isn’t licensed for commercial use. It’s probably an update to the licensing.

Facebook benefits heavily from the open source development done on LLaMA. There was a report I saw that facebook has started using llama.cpp internally for inference. Updates to the licensing will cement facebook as the go to choice for open source language models.

jerrygenser2y ago

Based on the podcast with Lex Friedman and Mark Zuckerberg, see ~minute 30.

My hypothesis based on the context of Mark discussing the release is that it's going to be completely open source and can licensed to be used commercially. Not that Meta is going to add a whole new revenue side of business to compete with OpenAI. i.e. "Here is model, with commercially permissive licensing" not "Here is model that you can use commercially but must pay me"

https://www.youtube.com/watch?v=Ff4fRgnuFgQ&ab_channel=LexFr...

hospitalJail2y ago

Another hypothesis is that they are trying to rehab their brand.

They can even write it as 'good will' on their financial statements.

It kind of is working.

justapassenger2y ago

Meta has been one of the major open source contributors for about a decade now. They open source/contribute to a lot of tech, as their business isn’t about tech, but products.

smoldesu2y ago

This isn't some recent revelation or anything. Facebook's AI team (FAIR) open sourced their major technology in 2017 with Pytorch. In 2018 they published Pytext in an age when most people didn't know what a Large Language Model even meant. Seeing LLaMA get made should not be a surprise to anyone who is familiar with the history of AI research. It's like hearing people call CUDA an "unfair advantage" while ignoring billions of Nvidia R&D dollars getting spent in the AI sector over the course of a decade.

It might feel like "brand rehab" or "good will" as a consumer, but a lot of this work was put in motion a while ago.

dpflan2y ago

I don’t know, maybe they don’t need to monetize their model? I don’t know if they have to, they need their models to be the best and to support their core business of ads, anything that keeps users on their platform for any reason is their goal. They need their models to be an industry standard and one upon which other things are built.

pueblito2y ago

I think the strategy is more to prevent competitors from monetizing

fullshark2y ago

That's a huge reason to do it also, but it also makes sense if you have researchers + developers improving the engine of something that powers your product. The moat / competitive advantage at FB is their network, not so much the proprietary underlying tech.

mtillman2y ago

People often say this but having interviewed ~200 facebook engineers over the years, their scaling tech around both software and hardware is pretty impressive.

1 more reply

treprinum2y ago

You still need to build real-time serving infrastructure on top of LLaMA/Vicuna/Alpaca in order to compete with ChatGPT/OpenAI so it's not going to be done by that many companies and OpenAI already has a mindshare/first mover advantage.

staticman22y ago

When you use ChatGPT you are leasing their GPU infrastructure and their proprietary model, this opens the possibility of leasing GPU infrastructure from another company and using an open source model. You don't necessarily need to do the hard parts yourself, you can hire it out to competing companies.

1 more reply

dpflan2y ago

Yes, commoditize the competition.

Roark662y ago

Well, if they really released it as open source, I guess depending on the exact license a company that modifies(fine tunes) it and wants to make money on that modified version would have to distribute the weights and/or disclose the details about how they fine tuned it. On what data etc. By offering a commercial license , the buyer can do anything they want.

fnands2y ago

There is an open source version available, with weights that were leaked, but licensed as "for academic purposes only".

This seems they will release the weights under some license that allows commercial usage.

How they monetise it (which I assume they will try and do?) is an interesting question.

Maybe some variant of paying a licencing fee?

CharlesW2y ago

> What's the monetization model here?

There doesn't necessarily have to be one. Facebook's goal may be to help commoditize its complements. https://gwern.net/complement

zpeti2y ago

Free AI models mean more free content, which is exactly what drives facebooks moat.

thatguymike2y ago

Commoditize Your Complement, it’s a strategic play since Meta is behind on LLMs.

elorant2y ago

You could pay to customize it and/or retrain it for your use case. Or you pay a subscription and every few months you receive updated weights.

RobotToaster2y ago

Maybe they just want the de facto standard LLM to be one that only says nice things about facebook and Zuckerberg?

discmonkey2y ago

Meta is a company that makes money off of users endlessly browsing content. It would follow that making it easier/faster to generate content would benefit Meta.

sagebird2y ago

repeat after me:

hardware is the only moat

If you want to live the good life before you are exquisitely extinguished, spend every other day figuring out how to buy more NVDA, the other days exercising outside, being human.

sifar2y ago

until better algorithms or newer paradigm obviate the need for large memory/computations

bilsbie2y ago

Is it possible to do further training on the weights they release?

brucethemoose22y ago

Yes, and there are a sea of finetunes. See: https://huggingface.co/models?sort=modified&search=Ggml

QLORA is the most cost effective method so far. Some people also do finetuning on Google TPUs

40yearoldman2y ago

Is the title an oxymoron?

Open-source commercial?

RobotToaster2y ago

If anything it's a tautology, open source by definition allows commercial use.

satvikpendem2y ago

No, you can sell open source software commercially. That being said, I'm wondering if the license will truly be open source or more like Stable Diffusion's license which is not really open source.

schleck82y ago

Because deep learning weights aren't source code.

https://huggingface.co/blog/open_rail

isaacremuant2y ago

I think you could've googled that one and founds years of knowledge on that one.

Free as in beer Vs free as in speech and the whole thing.

gpm2y ago

Commercial presumably as opposed to non-commercial licensing (e.g. the CC BY-NC license, or the weird situation LLaMa is in).

If you listen to the definition the Open Source Initiative would have applied to the term open source had they succeeded in acquiring rights to the term, then commercial is redundant with open source, not the opposite of it.

j / k navigate · click thread line to collapse

159 comments

foob2y ago

From the recent story about the Sarah Silverman lawsuit:

[1] https://news.ycombinator.com/item?id=36657540

ramshanker2y ago

And they will have much better knowledge, answers, etc than the western, Lawyer approved models.

Sometimes knowledge needs to be set free I guess.

TX81Z2y ago

The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things.

At this point with the quality of current web content and the collapse of journalism as an industry I think we can say online ads have utterly failed as a replacement income stream.

Unless you want all LLM to say “I’m sorry the data I was trained on ends in 2023” you still need a content funding model. Maybe not copyright, but sure as hell not ads either.

pmoriarty2y ago

"Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things."

1 more reply

4111111111111112y ago

You state this as a fact, but it's actually much less certain wherever it's ever been net-positive.

1 more reply

IshKebab2y ago

The production of knowledge (I assume you're mainly talking about scientific research here) is absolutely not funded by copyright royalties or anything like that.

Journals get their content for free. Actually often they charge the authors for it.

Research is mainly funded by governments and taxes.

1 more reply

jrm42y ago

Put differently, we consider -- but don't think a whole lot about -- about Wikipedia's "funding," because that's NOT the most important part/innovation of that model.

We should better answer what is?

Vt71fcAqt72y ago

>The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time.

dogma11382y ago

l33t2333722y ago

If the company supplied the employee with the “illegally obtained” books, that could be reason to view the situation differently than an employee acting on their own.

Since the company is obtaining + providing these models with 100% of their input data, it could be argued they have some responsibility to verify the legality of their procurement of the data.

stainablesteel2y ago

its not deemed illegal yet

its in a weird place imo, with japan ruling that anything goes for AI data, other countries are put under pressure to allow the same

ie,

you're allowed to scrape the web

you're allowed to take what you scrape and put it in a database

you're allowed to use your database to inform on decisions you might make, or content you might create

and if truly free and open source LLMs come into the game, then might the corporate ones become crippled from copyright? that's bad for business

brucethemoose22y ago

> It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).

They probably can:

https://github.com/zjunlp/EasyEdit

> I wonder if this is going to cause issues down the road.

There are some popular Stable Diffusion models, being run in small businesses, that I am certain have CSAM in them because they have a particular 4chan model in their merging lineage.

... And yet, it hasn't blown up yet? I have no explanation, but running "illegal" weights seems more sustainable than I would expect.

Der_Einzige2y ago

brucethemoose22y ago

People have been saying this about underage hand drawn hentai forever, but its still around.

Not that I am disagreeing with you. What I find particularly disturbing are the paid services for this.

Also, I have seen 2 seperate OnlyFans pimps ask for help in a text generation chatroom. Something about automating "private" texting from their "girls."

1 more reply

whimsicalism2y ago

That is not at all the same thing as removing the books.

twayt2y ago

> They probably can:

xnx2y ago

It's an area of active research: https://ai.googleblog.com/2023/06/announcing-first-machine-u...

2 more replies

brucethemoose22y ago

They can probably prevent LLaMA from spitting out verbatim quotes from the books well enough to make proof difficult.

... But yeah, fundamentally the only way to throw out the books is to throw out the weights.

potsandpans2y ago

that is quite the spicy claim

wongarsu2y ago

_Algernon_2y ago

How is it different than training from random blogs, or stack overflow or in general "The Internet"?

schleck82y ago

Really, really bad look for Eleuther if this is true. I did not expect them do something like this and not even see the issue with it.

Ancalagon2y ago

Move fast and break the law.

andybak2y ago

It's far from certain at this stage whether this does break the law.

1 more reply

innagadadavida2y ago

5 more replies

TX81Z2y ago

It worked for Uber!

1 more reply

Der_Einzige2y ago

Most large datasets are full of copyrighted content. They aren’t unique.

cameldrv2y ago

It seems difficult to argue that Meta can copy every ebook in existence to train a model, but then other people cannot copy the resulting model.

zargon2y ago

It's not open source, it's freeware or something like that. Weights aren't the source code of LLMs, they're the binaries.

spmurrayzzz2y ago

Maybe this is just semantics, but I don't know if the OSS-vs-freemium distinction matters all that much (I'd have to think about the potential downsides a bit more tbh).

zargon2y ago

Vetch2y ago

1 more reply

spmurrayzzz2y ago

> But I don't think we should let companies get away with poisoning the term open source for things which are not.

Thats totally fair. And you're correct in that I was making an argument for positive outcomes being orthogonal to the semantics distinction.

axus2y ago

Sen. Marsha Blackburn said “fair use” protections have become a “fairly useful way to steal” intellectual property. Some people would like to use this situation to get rid of "fair use".

slimebot802y ago

Forgive my ignorance, but might it matter if a country was hoping to limit another countries advancement into weaponising AI?

whimsicalism2y ago

Strong disagree - I think OSS is fine framing of this. Weights are a third category, you can 'fork' them in an a way that you can't with standard binaries.

l33t2333722y ago

You can add hooks to functions and “fork” binaries, which is a pretty similar effort to adding training data to given model weights.

IshKebab2y ago

Nobody does that because if you only have binaries you probably don't have permission to do that. Plus it's impractical to make any significant changes that way.

1 more reply

williamstein2y ago

powersnail2y ago

Source code is the preferred form for development.

In your scenario, despite the unrealistic coding process, the machine code is the source code, because that's what everyone is working on.

whimsicalism2y ago

What? You work on the weights - you just do it using tools like the optimizers, etc.

You release your weights, others can build on top of that, fine tune it in different ways, produce new weights they can share with others. Seems very OSS-y.

I feel like there is some semantic nitpicky point being made here that is completely going over my head.

2 more replies

ssd5322y ago

I read it in all such discussions. What does it mean? I just have a very high level understanding of AI models. No idea how things work under the hood or what knobs can be tweaked.

doctoboggan2y ago

briancleland2y ago

1 more reply

zargon2y ago

ssd5322y ago

I thought model is the output of training. It's a binary file black box. That's what I had read somewhere.

1 more reply

hoofedear2y ago

Thank you for succinctly explaining the difference, I learned something today

StackOverlord2y ago

Compiling source code doesn't cost million of dollars though

ynniv2y ago

pphysch2y ago

If you had petabytes of highly interconnected source code, it could.

In a rough way, a NN is just a compiler designed to translate a boatload of simple data into a useful program that operates on similar data.

chejazi2y ago

Yea the weights are the secret sauce that OpenAI and competitors generally protect.

greatpostman2y ago

Meta is going to ruin open ais moat on purpose. Great business strategy and good for everyone but metas competitors

jonnat2y ago

strikelaserclaw2y ago

roody152y ago

It’s really the ultimate nightmare with the internet becoming just TV 3.0 in which content is controlled and curated … you just consume mindlessly.

It really is time for a new internet.

2 more replies

nomel2y ago

https://en.wikipedia.org/wiki/Infinite_Jest

_Algernon_2y ago

Infinite flame wars. Can't wait!

freedomben2y ago

I wish I could upvote this a dozen times. This is a very insightful comment. Read the link above first if you aren't sure what "commoditize their complements" means.

herodoturtle2y ago

Reminds me of Joel Spolsky’s essay on “Commoditize your complement”:

https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/

ekojs2y ago

Seems that the source is a FT article that was discussed yesterday: https://news.ycombinator.com/item?id=36712168

From the FT article: '“The goal is to diminish the current dominance of OpenAI,” said one person with knowledge of high-level strategy at Meta.'

forgingahead2y ago

This is not charity, this is a shrewd business move.

amelius2y ago

"Commoditize your complement"

whimsicalism2y ago

My guess is still the latter because that's what I've heard the rumors about, but this article is pretty unclear on this fact.

brucethemoose22y ago

Falcon 40B was released as a "free with royalties above a certain amount of profit" license, and got roasted for it. It was so bad that they changed the license to Apache.

I don't think any business would run such a "licensed" model over MPT 30B or Falcon 40B, unless its way better than LLaMA 65b.

whimsicalism2y ago

I think it is supposed to be better than LLaMA 65B. Plenty of businesses are paying for OAI API access.

pmarreck2y ago

I have a 128 core Threadripper, a 2080 Ti and a 3080 Ti.

How can I play with open source LLM's locally?

brucethemoose22y ago

Kobold.cpp is your best bet.

You can leverage those big CPUs while still loading both GPUs with a 65B model.

pmarreck2y ago

oooh, this is a great idea

brucethemoose22y ago

Also, I would suggest this model as one to play with:

https://huggingface.co/ycros/airoboros-65b-gpt4-1.4.1-PI-819...

Check the prompting syntax here, it has a huge effect on the output:

https://huggingface.co/jondurbin/airoboros-65b-gpt4-1.4

estreeper2y ago

If you're just looking to play with something locally for the first time, this is the simplest project I've found and has a simple web UI: https://github.com/cocktailpeanut/dalai

It works for 7B/13B/30B/65B LLaMA and Alpaca (fine-tuned LLaMA which definitely works better). The smaller models at least should run on pretty much any computer.

brucethemoose22y ago

That project seems unmaintained, which is a problem because llama.cpp is changing extremely rapidly.

Also, it has no "1 click" exe release like kobold.

freedomben2y ago

May I ask why you have such an amazing machine, and two nice graphics cards? Feel free to tell me it's none of my business, it's just very interesting to me :-)

pmarreck2y ago

It's a System76 machine, they make good stuff

nickthegreek2y ago

Check out r/LocalLlama for a bunch of resources.

loufe2y ago

stale20022y ago

I remember arguing with people who honest to god thought that LLAMA was some sort of secret ploy, to trick startups into using it, so that meta could sue them for using it commercially.

Well now there is a commerical release. I guess it wasn't some corporate plot after all!

Some people just can't admit when a corporation does a good thing.

(In this case, the good thing is being done to obsolete their competitors, but it is good none the less, that a commerical LLM is available for people to use for free)

obblekk2y ago

vlovich1232y ago

sva_2y ago

I mean you could probably just train it on some sequence s.t. the model identifies itself, would be hard to detect that

sebzim45002y ago

That would prbably work to detect if e.g. OpenAI or Anthropic start using their weights directly. It wouldn't detect whether e.g. a blog was generated with their model or not.

0cf8612b2e1e2y ago

From my quick skim I could not find a date. Any idea when this might happen?

rvz2y ago

Still waiting for the 'Meta is dying' and 'Fire Mark Zuckerberg' calls from last year. A year later, where are they now?

sebzim45002y ago

To be fair to those commenters in the past, I don't think anyone could have forseen that Zuckerberg would turn out to be the "the good guy".

TheBengaluruGuy2y ago

This conversation triggers a thought.

Does it mean that any blogs that I wrote from my own insights, will automatically be trained on the model… without my permission?

As an author, it feels like it’s stealing the knowledge and insight without appropriate attribution.

anaganisk2y ago

Jeff_Brown2y ago

TechnicolorByte2y ago

isaacremuant2y ago

sangnoir2y ago

> More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over.

dundun2y ago

Or Maybe Yann LeCun is just a hippie that dreams of free love, hard drugs and open-source models?

andy992y ago

They might have done well to make gg an offer he couldn't refuse and take on ggml and llama.cpp as an open source project.

valine2y ago

LLaMA isn’t licensed for commercial use. It’s probably an update to the licensing.

jerrygenser2y ago

Based on the podcast with Lex Friedman and Mark Zuckerberg, see ~minute 30.

https://www.youtube.com/watch?v=Ff4fRgnuFgQ&ab_channel=LexFr...

hospitalJail2y ago

Another hypothesis is that they are trying to rehab their brand.

They can even write it as 'good will' on their financial statements.

It kind of is working.

justapassenger2y ago

Meta has been one of the major open source contributors for about a decade now. They open source/contribute to a lot of tech, as their business isn’t about tech, but products.

smoldesu2y ago

It might feel like "brand rehab" or "good will" as a consumer, but a lot of this work was put in motion a while ago.

dpflan2y ago

pueblito2y ago

I think the strategy is more to prevent competitors from monetizing

fullshark2y ago

mtillman2y ago

People often say this but having interviewed ~200 facebook engineers over the years, their scaling tech around both software and hardware is pretty impressive.

1 more reply

treprinum2y ago

staticman22y ago

1 more reply

dpflan2y ago

Yes, commoditize the competition.

Roark662y ago

fnands2y ago

There is an open source version available, with weights that were leaked, but licensed as "for academic purposes only".

This seems they will release the weights under some license that allows commercial usage.

How they monetise it (which I assume they will try and do?) is an interesting question.

Maybe some variant of paying a licencing fee?

CharlesW2y ago

> What's the monetization model here?

There doesn't necessarily have to be one. Facebook's goal may be to help commoditize its complements. https://gwern.net/complement

zpeti2y ago

Free AI models mean more free content, which is exactly what drives facebooks moat.

thatguymike2y ago

Commoditize Your Complement, it’s a strategic play since Meta is behind on LLMs.

elorant2y ago

You could pay to customize it and/or retrain it for your use case. Or you pay a subscription and every few months you receive updated weights.

RobotToaster2y ago

Maybe they just want the de facto standard LLM to be one that only says nice things about facebook and Zuckerberg?

discmonkey2y ago

Meta is a company that makes money off of users endlessly browsing content. It would follow that making it easier/faster to generate content would benefit Meta.

sagebird2y ago

repeat after me:

hardware is the only moat

If you want to live the good life before you are exquisitely extinguished, spend every other day figuring out how to buy more NVDA, the other days exercising outside, being human.

sifar2y ago

until better algorithms or newer paradigm obviate the need for large memory/computations

bilsbie2y ago

Is it possible to do further training on the weights they release?

brucethemoose22y ago

Yes, and there are a sea of finetunes. See: https://huggingface.co/models?sort=modified&search=Ggml

QLORA is the most cost effective method so far. Some people also do finetuning on Google TPUs

40yearoldman2y ago

Is the title an oxymoron?

Open-source commercial?

RobotToaster2y ago

If anything it's a tautology, open source by definition allows commercial use.

satvikpendem2y ago

No, you can sell open source software commercially. That being said, I'm wondering if the license will truly be open source or more like Stable Diffusion's license which is not really open source.

schleck82y ago

Because deep learning weights aren't source code.

https://huggingface.co/blog/open_rail

isaacremuant2y ago

I think you could've googled that one and founds years of knowledge on that one.

Free as in beer Vs free as in speech and the whole thing.

gpm2y ago

Commercial presumably as opposed to non-commercial licensing (e.g. the CC BY-NC license, or the weird situation LLaMa is in).

j / k navigate · click thread line to collapse