Free Dolly: First truly open instruction-tuned LLM (opens in new tab)

(databricks.com)

181 pointsrxin3y ago49 comments

49 comments

There's some blatant astroturfing from new accounts going on in this thread - I gotta say it's not the best impression

The irony is that the English is bad in a lot of the astroturf comments… why don’t they use the LLM to generate them? It would probably seem more genuine than the current comments.

tornato73y ago

Indeed, I'm not sure how to summon @dang but the number of databricks shill comments in this thread is absurd.

summarity3y ago

Yes, this is a cool enough achievement on its own. HN is not the best place for these "testimonials". Unless this is some sort of demo of the LLM itself /s

nidnogg3y ago

If this is a demo then it's a really flimsy one showcasing a bit of poor English if anything

alsodumb3y ago

Maybe OP, Data bricks founder, should reply to this comment and answer if he or anyone else in his team paid people to get this post with a not-so-useful-dataset to the top of HN.

rxinOP3y ago

(OP / Databricks cofounder here)

I'm not sure what happened but I just sent an email out internally to ask people not to do this. The team might have gotten overly excited by this because they were all part of the creation of the dataset and the model.

baidifnaoxi3y ago

This seems to happen a bit too often tbh on all things Databricks.

mdrzn3y ago

Seems like they are testing autonomous agents in the wild

PurpleRamen3y ago

WTF? They don't even invest effort in writing good text. This is either some very enthusiastic crowd of workers, who only know twitter and comment-sections, or a very bad marketing-bot. Makes me wonder if something like this often happens here, and I just never see it.

pauldix3y ago

Shame that this is flagged, I think this is a really exciting development and was hoping to see the discussion around it. Open sourcing the fine tuning training set is a great building block. Will be exciting to see if others continue to build on this. More open source datasets, models, and evaluation frameworks will accelerate the development and adoption of LLMs. It adds more hackers to the mix building the core, rather than just the stuff at the edges (i.e. apps).

alsodumb3y ago

The post was rightfully flagged while trying to make it's way to the frontpage of HN with blatant astroturfing.

I was hoping to see good discussion around too. And it would have happened had Data bricks employees or PR people didn't create a hundred accounts to comment on this and the previous DOLLY post.

pauldix3y ago

Yeah, if it's getting astroturfed and pushed up artificially that's definitely cause to flag it. It's just a shame that happened because this almost certainly would have landed on the front page on its own.

lostmsu3y ago

There are already a number of instruction-tuning datasets.

This announcement does not provide any benchmarks, so it is impossible to tell how useful the model is.

Mizza3y ago

Shills, begone!

I brought up the issue of the "dirty" model in their last announcement thread, very cool to see them take that to heart and quickly address the issue. Impressive marketing and engineering.

yawnxyz3y ago

Interesting to see that it's trained on data completely generated by Databricks employees. I wonder how "biased" that makes the data, and how much they spent in terms of lost man hours?

rxinOP3y ago

(Databricks cofounder here)

All datasets are biased, including this specific one. However, we believe it's still very valuable to open source, for a few reasons:

- This dataset is primarily used to train instruction reasoning, not for knowledge. (Keep in mind Dolly and any of the well known models have not been specifically trained for knowledge. They are all just demonstrating instruction reasoning.) The lack of a true open source (available for both research and commercial use) instruction dataset is the primary blocker for making these LLMs available for commercial use.

- We hope this will lead to not just open source innovations in models, but also future training datasets.

- Given the international population of our employee based, it's likely more diverse than datasets created by a small number of human labelers. And it is easier to identify, discuss, and debate dataset bias in the open.

SeagullJ3y ago

Since databricks has employees all around the world, I expect to see that the data is not biased. But for sure there must be considerable man hours lost. However it shows, how much they are dedicated towards open source contributions.

alsodumb3y ago

The only thing this comment and your profile history goes to show is how you work at this company.

satvikpendem3y ago

I'm looking forward to using this for my startup. Lots of people are using LLaMA derived models but I'm not sure if they're reading the license since it's still non-commercial only, even though many people are treating it like true open source.

The only other one I've seen that's actually open source OpenAssistant, also based on the pythia models I believe.

covi3y ago

Kudos to Databricks! Anyone has insights into benchmark & real-world quality?

From https://huggingface.co/databricks/dolly-v2-12b#benchmark-met..., it seems like dolly-v2-12b's benchmark results are actually slightly worse than dolly-v1-6b.

A commercially viable instruction-tuned LLM is still a huge deal.

mydpy3y ago

Right, but this is still impressive given how quickly Databricks created and open-sourced the dataset. I would expect more improvements in the future.

QuadrupleA3y ago

Any info on Pythia base model performance versus GPT-3 or 3.5? Couldn't find any benchmarks in the paper. I imagine LLaMA is ahead there.

simcop23873y ago

I've not done any real benchmarking but the OpenAssistant fine tuning from LAION has been done on it. It worked reasonably well for something local but definitely felt like it wasn't nearly as complete/advanced as any of the ChatGPT stuff. I imagine this Databricks setup is more complete there but I personally wouldn't expect too much more than GPT-3 level performance. That said if this dataset is open (I haven't really looked too much at the article yet) then you could quite easily use it to tune LLaMA just like the stanford alpaca models, which might be a better combo. Though that wouldn't be licensed for commercial use then given the underlying license.

Uehreka3y ago

Do we have any quantitative way of benchmarking the quality of these models at all? Like, I don’t care if a model takes one minute per token on my laptop as long as it’s “GPT-4 quality”, and I don’t care if it does 100 tokens per second if it’s straight crap. But every comparison I see people make regarding quality seems to come from “I asked it a couple of my favorite questions and it did… uh, only a little worse than GPT imo”

QuadrupleA3y ago

Perplexity scores are pretty common, which I think involves taking a text corpus like wikipedia and seeing how well the model predicts the next word (token) of it.

kenniy3y ago

As an academic researcher with a significant interest and time-investment in Transformer-based models, this restores my faith/hope in the trajectory of DL research. Considering it is difficult for academics to catch up to the industry regarding LLMs, seeing a continuation of the OPENness of these research works by a major industry player is a move in the right direction.

ibejoeb3y ago

I deal with this stuff daily, so I think it's probably irrationally grinding my gears, but:

>> How do I build a campfire?

> Safety should always come first when starting a campfire.

Hold up: should I touch the fire? It doesn't say.

OK, there's perfectly legitimate advice in the output, like "have water nearby," but give me a break already. They're finetuning for commercial application. If I'm building business tools, I'm not putting kid gloves on it. I don't have time for a lecture every time I need an answer.

You can put a safety model in front of an unencumbered model if you want. We don't need to conflate the two.

Nifty39293y ago

This is just being prudent from a business perspective. Much to lose due to lawsuits. Look at the warnings on Q-Tips (which everybody in the US uses to dry their ears) which have warnings not to use in ears.

losvedir3y ago

Does it work...? The examples given at the bottom of the post were pretty great, but could easily have been cherry picked. I'd be curious to see how it performs against standard benchmarks.

But I love the thought here. I didn't realize the instruction tuning for GPT was from only 40 people. It really does bring into perspective how easily a motivated large organization could bring their employees to bear to do something like this, and I'm grateful that DataBricks has done it and is sharing it here.

I wish I understood how LLMs work a little better. This is a neat piece of the puzzle I wasn't fully aware of. But now my mental model is that LLMs work with kind of "three layers" of inputs:

* The base many-billion or even trillion parameter model, trained on a huge corpus of text, which basically is how it learns to use language as I/O.

* The instruction tuning, on just tens of thousands of inputs, to give the raw model some further guidance. This is a sort of transfer learning, maybe? Doing further training on top of a big model?

* The prompt itself can provide further inputs and context to tweak how the response should look.

I had been thinking of LLMs in terms of the first layer, the base model, and the bottom layer the prompt, and was thinking that you could get progressively more sophisticated in the prompt "context" to have LLMs tailor made for your particular use case.

But actually, there's a decent chunk of space to explore on the instruction tuning? Like, say you wanted an LLM to help lawyers with case law or something, to keep it from hallucinating quite as much and being more detailed and useful. Is that something that would fit in the middle layer? Could a "legal AI startup" tackle that problem by starting with a big open source base model, proprietarily tuning it with 10s of thousands of legal questions and answers, and then sharing that model with law firms, with maybe a customer support rep at the firm able to do the final tweaking with the prompt context? Is that how this all fits together?

The examples here of digesting DataBricks info and customer support tickets I found really interesting. How exactly would large companies like DB tailor LLMs to their particular use cases and data?

6gvONxR4sf7o3y ago

I’m so tired of models announced without any benchmarking of quality.

marban3y ago

Why is this post full of n00b user 'comments'?

ninkendo3y ago

Astroturfing. Likely databricks told their employees to create HN accounts and comment on this thread to get traction (or just had one PR person make a bunch.)

@dang, any chance we can just ban all these accounts? Seems to be pretty cut and dry here.

benpacker3y ago

Mods help! This thread is getting positive bot spammed. All these accounts were created within the last half hour

nicpottier3y ago

Looks like this is 12b parameters. Will this fit on a 32gb M1?

rnk3y ago

I still think this question is important. I want to know also, how much ram on m1 or m2 is needed for good use as these models grow (64, 96, 128?).

michaelhartm3y ago

I guess Databricks is now going after OpenAI?

IanCal3y ago

I don't expect so, but as LLMs get better more people want to use them in more places. An openly available LLM for commercial use which is easy to integrate in your existing databricks flow could be very tempting. That then leads to increased use of the platform & hours spent computing, so that's better for them.

It also shows how to build and train these things on databricks, so maybe more people will use them to make custom trained LLMs.

visarga3y ago

My hunch is that OpenAI's window of supremacy will be short. Even if they keep being SOTA, the open sourced models will eat away market underneath them. By the next year they will only be able to sell GPT-4 or 5. At some point the open models will be good enough for 99% of use cases.

IanCal3y ago

Quality is one aspect, running them is another. If I've already got everything setup with them and they work efficiently, they could also offer open source models and let me pay for usage. Both bursty usage and low constant usage benefit from paying per token and having some shared & large infrastructure to use. I don't want to be running a bunch of h100s, I just want my requests processed.

If they're selling gpt-5 and let me pay for LLaMa or whatever is also out then I'll just use them unless pricing is wildly different.

1 more reply

TheObviousOne3y ago

Not sure at all..

They are siting on gigantic data flywheel of users usages.. (and private access to the best machine that can utilized that data..)

So, IMHO it would be hard to catch with their speed (not in this hype cycle).

pbharrin3y ago

On Monday Yann LeCun and Andrew Ng said instruction following LLMs would be commoditized. Apparently they were right.

prhrb3y ago

Why is this submission flagged?

dontknowyet3y ago

LLM for everyone - it is awesome to see how easy it is to train your own LLM without much effort! And open source

j / k navigate · click thread line to collapse

49 comments

nidnogg3y ago

There's some blatant astroturfing from new accounts going on in this thread - I gotta say it's not the best impression

ninkendo3y ago

The irony is that the English is bad in a lot of the astroturf comments… why don’t they use the LLM to generate them? It would probably seem more genuine than the current comments.

tornato73y ago

Indeed, I'm not sure how to summon @dang but the number of databricks shill comments in this thread is absurd.

summarity3y ago

Yes, this is a cool enough achievement on its own. HN is not the best place for these "testimonials". Unless this is some sort of demo of the LLM itself /s

nidnogg3y ago

If this is a demo then it's a really flimsy one showcasing a bit of poor English if anything

alsodumb3y ago

Maybe OP, Data bricks founder, should reply to this comment and answer if he or anyone else in his team paid people to get this post with a not-so-useful-dataset to the top of HN.

rxinOP3y ago

(OP / Databricks cofounder here)

baidifnaoxi3y ago

This seems to happen a bit too often tbh on all things Databricks.

mdrzn3y ago

Seems like they are testing autonomous agents in the wild

PurpleRamen3y ago

pauldix3y ago

alsodumb3y ago

The post was rightfully flagged while trying to make it's way to the frontpage of HN with blatant astroturfing.

I was hoping to see good discussion around too. And it would have happened had Data bricks employees or PR people didn't create a hundred accounts to comment on this and the previous DOLLY post.

pauldix3y ago

lostmsu3y ago

There are already a number of instruction-tuning datasets.

This announcement does not provide any benchmarks, so it is impossible to tell how useful the model is.

Mizza3y ago

Shills, begone!

I brought up the issue of the "dirty" model in their last announcement thread, very cool to see them take that to heart and quickly address the issue. Impressive marketing and engineering.

yawnxyz3y ago

Interesting to see that it's trained on data completely generated by Databricks employees. I wonder how "biased" that makes the data, and how much they spent in terms of lost man hours?

rxinOP3y ago

(Databricks cofounder here)

All datasets are biased, including this specific one. However, we believe it's still very valuable to open source, for a few reasons:

- We hope this will lead to not just open source innovations in models, but also future training datasets.

SeagullJ3y ago

alsodumb3y ago

The only thing this comment and your profile history goes to show is how you work at this company.

satvikpendem3y ago

The only other one I've seen that's actually open source OpenAssistant, also based on the pythia models I believe.

covi3y ago

Kudos to Databricks! Anyone has insights into benchmark & real-world quality?

From https://huggingface.co/databricks/dolly-v2-12b#benchmark-met..., it seems like dolly-v2-12b's benchmark results are actually slightly worse than dolly-v1-6b.

A commercially viable instruction-tuned LLM is still a huge deal.

mydpy3y ago

Right, but this is still impressive given how quickly Databricks created and open-sourced the dataset. I would expect more improvements in the future.

QuadrupleA3y ago

Any info on Pythia base model performance versus GPT-3 or 3.5? Couldn't find any benchmarks in the paper. I imagine LLaMA is ahead there.

simcop23873y ago

Uehreka3y ago

QuadrupleA3y ago

Perplexity scores are pretty common, which I think involves taking a text corpus like wikipedia and seeing how well the model predicts the next word (token) of it.

kenniy3y ago

ibejoeb3y ago

I deal with this stuff daily, so I think it's probably irrationally grinding my gears, but:

>> How do I build a campfire?

> Safety should always come first when starting a campfire.

Hold up: should I touch the fire? It doesn't say.

You can put a safety model in front of an unencumbered model if you want. We don't need to conflate the two.

Nifty39293y ago

losvedir3y ago

Does it work...? The examples given at the bottom of the post were pretty great, but could easily have been cherry picked. I'd be curious to see how it performs against standard benchmarks.

I wish I understood how LLMs work a little better. This is a neat piece of the puzzle I wasn't fully aware of. But now my mental model is that LLMs work with kind of "three layers" of inputs:

* The base many-billion or even trillion parameter model, trained on a huge corpus of text, which basically is how it learns to use language as I/O.

* The instruction tuning, on just tens of thousands of inputs, to give the raw model some further guidance. This is a sort of transfer learning, maybe? Doing further training on top of a big model?

* The prompt itself can provide further inputs and context to tweak how the response should look.

The examples here of digesting DataBricks info and customer support tickets I found really interesting. How exactly would large companies like DB tailor LLMs to their particular use cases and data?

6gvONxR4sf7o3y ago

I’m so tired of models announced without any benchmarking of quality.

marban3y ago

Why is this post full of n00b user 'comments'?

ninkendo3y ago

Astroturfing. Likely databricks told their employees to create HN accounts and comment on this thread to get traction (or just had one PR person make a bunch.)

@dang, any chance we can just ban all these accounts? Seems to be pretty cut and dry here.

benpacker3y ago

Mods help! This thread is getting positive bot spammed. All these accounts were created within the last half hour

nicpottier3y ago

Looks like this is 12b parameters. Will this fit on a 32gb M1?

rnk3y ago

I still think this question is important. I want to know also, how much ram on m1 or m2 is needed for good use as these models grow (64, 96, 128?).

michaelhartm3y ago

I guess Databricks is now going after OpenAI?

IanCal3y ago

It also shows how to build and train these things on databricks, so maybe more people will use them to make custom trained LLMs.

visarga3y ago

IanCal3y ago

If they're selling gpt-5 and let me pay for LLaMa or whatever is also out then I'll just use them unless pricing is wildly different.

1 more reply

TheObviousOne3y ago

Not sure at all..

They are siting on gigantic data flywheel of users usages.. (and private access to the best machine that can utilized that data..)

So, IMHO it would be hard to catch with their speed (not in this hype cycle).

pbharrin3y ago

On Monday Yann LeCun and Andrew Ng said instruction following LLMs would be commoditized. Apparently they were right.

prhrb3y ago

Why is this submission flagged?

dontknowyet3y ago

LLM for everyone - it is awesome to see how easy it is to train your own LLM without much effort! And open source

j / k navigate · click thread line to collapse