But they did move down and that's what's important.
There should probably be more aggressive learning rate annealing for models trying to be Chinchilla-optimal instead of just cosine-with-warmup like every other model nowadays.
My current understanding of the story is, to recap:
- First the game was increase model size massively
- For example GPT3 had 175B parameters, but less than 0.5T tokens of training data
- Then Chinchilla showed for a given compute budget we can scale better by increasing training data
- Now we have models like this, and Phi, that have over 1T trained tokens
For any model, the loss curve going down could mean it’s learning, or could mean it’s overfitting, we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.
So getting back to your comment, I thought they were actually a multitude of indicators that should be used, not just validation loss, to determine what would be gained with more training.
You want to look at validation accuracy.
The experiment was fixed at 3 epochs on 1T tokens, they didn't decide to "stop" at a given criterion.
> we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.
The data I linked shows the validation loss, which has the same behavior as the training loss.
If the learning rate is too high at a given point in training, it can result in either a) the model stopping learning or b) exploding gradients, which is very bad.
It's not clear to me if this is applicable to LLMs though.
> we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs
I knew the computational power required to train LLMs was absurd, but seeing the figures of larger networks (which are just too large to intuitively understand) it didn't really register. With this one I could actually imagine the 16 machines with A100 GPUs sitting on a server room running at full blast for 90 days so it was more tangible... And now to think about the larger ones is kinda scary
Edit: Did the math and just the GPUs (at 250W each) consumed around 8.64 MWh, which is at the same ballpark of the power consumption of the average US home in one year (10.5MWh)
Of course, we’re probably both simplifying thing too much, but if these numbers are good enough it’s an interesting perspective.
At these sorts of costs and a final size of 2.2GB, each MB cost a few dollars to produce.
https://github.com/rbitr/llm.f90/tree/optimize16/purefortran
https://github.com/99991/SimpleTinyLlama
The new checkpoints did not seem much better and they changed the chat format for some reason, so I did not port the new checkpoints yet. Perhaps I'll get to it this weekend.
Edit: you have some rare knowledge, I'm curious if you have any thoughts on small models good enough for RAG. Mistral 7B is in my testing buts it's laughably slow and 7B is just too much for mobile, both iOS and Android get crashy. (4 tkns/s on Pixel Fold, similar on iOS). Similar problems on web from a good-enough 2 year old i7.
I'd try Phi-2 but I want to charge for my app and the non-commercial usage license bars that. (all these hours building ain't free! And I can't responsibly give search away, scraping locally is too risky for the user, and the free search API I know of has laudable goals, but ultimately, is "trust me bro" as far as privacy goes)
I'm starting to think we might not get an open, RAG capable model sub 7B without a concerted open source effort. Stabilitys distracted and spread thin, MS is all in on AI PCs(tm), and it's too commercially valuable for the big boys to give away
My best guess (and if I had a concrete answer I'd be out building it) is that, absent a breakthrough, smaller models will be mostly for downstream tasks, like classifiers, that aren't generative. Or fine tuned for specialized generative models that only know one domain. I don't know how well this works for real use cases, but certainly way smaller models generate Shakespeare-like text for example, I don't actually know why you'd do that though.
What's RAG?
The retrieval part is way more important.
I've used the original 13B instruction tuned llama2, quantized, and found it gives coherent answers about the context provided, ie the bottleneck was mostly getting good context.
When I played with long context models (like 16k tokens, and this was a few months ago, maybe they improved) they sucked.
Hopefully @xenova will make a copy with it soon.
I recall you were looking to sell it at some point. Was wondering what that process looked like, and why you ended up holding on to the site.
To answer your question: an earlier version of the site focused on surfacing AI news, but that space is super competitive and I don't think Emergent Mind did a better job than the other resources out there. I tried selling it instead of just shutting it down, but ultimately decided to keep it. I recently decided to pivot to covering arXiv papers, which is a much better fit than AI news. I think there's an opportunity with it to not only help surface trending papers, but help educate people about them too using AI (the GPT-4 summaries are just a start). A lot of the future work will be focused in that direction, but I'd also love any feedback folks have on what I could add to make it more useful.