I'm not sure what happened but I just sent an email out internally to ask people not to do this. The team might have gotten overly excited by this because they were all part of the creation of the dataset and the model.
I was hoping to see good discussion around too. And it would have happened had Data bricks employees or PR people didn't create a hundred accounts to comment on this and the previous DOLLY post.
This announcement does not provide any benchmarks, so it is impossible to tell how useful the model is.
I brought up the issue of the "dirty" model in their last announcement thread, very cool to see them take that to heart and quickly address the issue. Impressive marketing and engineering.
All datasets are biased, including this specific one. However, we believe it's still very valuable to open source, for a few reasons:
- This dataset is primarily used to train instruction reasoning, not for knowledge. (Keep in mind Dolly and any of the well known models have not been specifically trained for knowledge. They are all just demonstrating instruction reasoning.) The lack of a true open source (available for both research and commercial use) instruction dataset is the primary blocker for making these LLMs available for commercial use.
- We hope this will lead to not just open source innovations in models, but also future training datasets.
- Given the international population of our employee based, it's likely more diverse than datasets created by a small number of human labelers. And it is easier to identify, discuss, and debate dataset bias in the open.
The only other one I've seen that's actually open source OpenAssistant, also based on the pythia models I believe.
From https://huggingface.co/databricks/dolly-v2-12b#benchmark-met..., it seems like dolly-v2-12b's benchmark results are actually slightly worse than dolly-v1-6b.
A commercially viable instruction-tuned LLM is still a huge deal.
>> How do I build a campfire?
> Safety should always come first when starting a campfire.
Hold up: should I touch the fire? It doesn't say.
OK, there's perfectly legitimate advice in the output, like "have water nearby," but give me a break already. They're finetuning for commercial application. If I'm building business tools, I'm not putting kid gloves on it. I don't have time for a lecture every time I need an answer.
You can put a safety model in front of an unencumbered model if you want. We don't need to conflate the two.
But I love the thought here. I didn't realize the instruction tuning for GPT was from only 40 people. It really does bring into perspective how easily a motivated large organization could bring their employees to bear to do something like this, and I'm grateful that DataBricks has done it and is sharing it here.
I wish I understood how LLMs work a little better. This is a neat piece of the puzzle I wasn't fully aware of. But now my mental model is that LLMs work with kind of "three layers" of inputs:
* The base many-billion or even trillion parameter model, trained on a huge corpus of text, which basically is how it learns to use language as I/O.
* The instruction tuning, on just tens of thousands of inputs, to give the raw model some further guidance. This is a sort of transfer learning, maybe? Doing further training on top of a big model?
* The prompt itself can provide further inputs and context to tweak how the response should look.
I had been thinking of LLMs in terms of the first layer, the base model, and the bottom layer the prompt, and was thinking that you could get progressively more sophisticated in the prompt "context" to have LLMs tailor made for your particular use case.
But actually, there's a decent chunk of space to explore on the instruction tuning? Like, say you wanted an LLM to help lawyers with case law or something, to keep it from hallucinating quite as much and being more detailed and useful. Is that something that would fit in the middle layer? Could a "legal AI startup" tackle that problem by starting with a big open source base model, proprietarily tuning it with 10s of thousands of legal questions and answers, and then sharing that model with law firms, with maybe a customer support rep at the firm able to do the final tweaking with the prompt context? Is that how this all fits together?
The examples here of digesting DataBricks info and customer support tickets I found really interesting. How exactly would large companies like DB tailor LLMs to their particular use cases and data?
@dang, any chance we can just ban all these accounts? Seems to be pretty cut and dry here.
It also shows how to build and train these things on databricks, so maybe more people will use them to make custom trained LLMs.