TinyLlama: An Open-Source Small Language Model (opens in new tab)

(arxiv.org)

143 pointsmatt12y ago44 comments

44 comments

It was fun to follow the public TinyLlama loss curves in near real-time, although it showed that it can be frustrating since the loss curves barely moved down even after an extra trillion tokens: https://wandb.ai/lance777/lightning_logs/reports/metric-trai... (note the log-scaled X-axis)

But they did move down and that's what's important.

There should probably be more aggressive learning rate annealing for models trying to be Chinchilla-optimal instead of just cosine-with-warmup like every other model nowadays.

WhitneyLand2y ago

More aggressive learning rate? Wouldn’t we need more information on why they decided to stop training at this point to conclude that?

My current understanding of the story is, to recap:

- First the game was increase model size massively

- For example GPT3 had 175B parameters, but less than 0.5T tokens of training data

- Then Chinchilla showed for a given compute budget we can scale better by increasing training data

- Now we have models like this, and Phi, that have over 1T trained tokens

For any model, the loss curve going down could mean it’s learning, or could mean it’s overfitting, we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

So getting back to your comment, I thought they were actually a multitude of indicators that should be used, not just validation loss, to determine what would be gained with more training.

zozbot2342y ago

Overfitting is quite unlikely with a smaller model though. Model parsimony provides a kind of regularization "for free", in fact with the extra benefit of saving on compute costs.

1 more reply

kiratp2y ago

> For any model, the loss curve going down could mean it’s learning, or could mean it’s overfitting, we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

You want to look at validation accuracy.

1 more reply

minimaxir2y ago

> Wouldn’t we need more information on why they decided to stop training at this point to conclude that?

The experiment was fixed at 3 epochs on 1T tokens, they didn't decide to "stop" at a given criterion.

> we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

The data I linked shows the validation loss, which has the same behavior as the training loss.

1 more reply

ssivark2y ago

How crucial is it to freeze the learning rate schedule a priori, instead of tweaking it on the fly?

minimaxir2y ago

Constant learning rates were the default in older ML implementations, but linear decay became an obvious optimization, and now we have both warmup and cosine decay to handle common training patterns, especially with the AdamW optimizer.

If the learning rate is too high at a given point in training, it can result in either a) the model stopping learning or b) exploding gradients, which is very bad.

grandma_tea2y ago

Adaptive learning rate is a thing. For example, one scheme I've used before is to decrease the learning rate if the validation loss stops decreasing.

It's not clear to me if this is applicable to LLMs though.

TheCoreh2y ago

From the GitHub repo Readme:

> we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs

I knew the computational power required to train LLMs was absurd, but seeing the figures of larger networks (which are just too large to intuitively understand) it didn't really register. With this one I could actually imagine the 16 machines with A100 GPUs sitting on a server room running at full blast for 90 days so it was more tangible... And now to think about the larger ones is kinda scary

Edit: Did the math and just the GPUs (at 250W each) consumed around 8.64 MWh, which is at the same ballpark of the power consumption of the average US home in one year (10.5MWh)

chrismorgan2y ago

So, four A100-years. Unit cost $8,000 (from a quick search) and electricity cost under $2,000. If you reckon the useful life of an A100 to be four years then that’s a training cost approaching $10,000. I have no idea of the forecast useful life of the GPU, but I’d hope it’d be a lot longer; if it was about ten years, then this training cost would be around $5,000.

Of course, we’re probably both simplifying thing too much, but if these numbers are good enough it’s an interesting perspective.

At these sorts of costs and a final size of 2.2GB, each MB cost a few dollars to produce.

andy992y ago

I've been using one of the earlier checkpoints for benchmarking a Llama implementation. Completely anecdotally I feel at least as good or better about this one than the earlier openllama 3B. I wouldn't use either of them for RAG or anything requiring more power, just to say that it's competitive as a smaller model, whatever you use those for, and easy to run on CPU at FP16 (meaning without serious quantization).

andy992y ago

Also, I should promote the code I wrote for running this. It runs models in ggml format, the one I made available is an older checkpoint though. It's easy to convert the newer one. And it's in Fortran but it should be easy to get gfortran if you don't have it installed.

https://github.com/rbitr/llm.f90/tree/optimize16/purefortran

xml2y ago

Here is another inference implementation in Python (only dependency is PyTorch).

https://github.com/99991/SimpleTinyLlama

The new checkpoints did not seem much better and they changed the chat format for some reason, so I did not port the new checkpoints yet. Perhaps I'll get to it this weekend.

refulgentis2y ago

Man I didn't recognize your username but once you said Fortran I recognized you immediately. You are an inspiration, an example of true software engineering vis a vis what in day to day becomes what you can hire for.

Edit: you have some rare knowledge, I'm curious if you have any thoughts on small models good enough for RAG. Mistral 7B is in my testing buts it's laughably slow and 7B is just too much for mobile, both iOS and Android get crashy. (4 tkns/s on Pixel Fold, similar on iOS). Similar problems on web from a good-enough 2 year old i7.

I'd try Phi-2 but I want to charge for my app and the non-commercial usage license bars that. (all these hours building ain't free! And I can't responsibly give search away, scraping locally is too risky for the user, and the free search API I know of has laudable goals, but ultimately, is "trust me bro" as far as privacy goes)

I'm starting to think we might not get an open, RAG capable model sub 7B without a concerted open source effort. Stabilitys distracted and spread thin, MS is all in on AI PCs(tm), and it's too commercially valuable for the big boys to give away

1 more reply

eachro2y ago

What use cases would you say it is good enough for?

andy992y ago

That's the billion dollar question. These are all research models, the point was to see what happens when you keep training a smaller model.

My best guess (and if I had a concrete answer I'd be out building it) is that, absent a breakthrough, smaller models will be mostly for downstream tasks, like classifiers, that aren't generative. Or fine tuned for specialized generative models that only know one domain. I don't know how well this works for real use cases, but certainly way smaller models generate Shakespeare-like text for example, I don't actually know why you'd do that though.

rnd02y ago

>I wouldn't use either of them for RAG

What's RAG?

sp3322y ago

Since the models have a limited context size, you pre-process a bunch of data that might be related to the task (documentation, say) and generate a semantic vector for each piece. The when you ask a question, look up just the few pieces that are semantically most simlar and load them into the context along with the question. Then the LLM can generate a new answer with the most relevant pieces of data.

andy992y ago

Retrieval augmented generative, basically giving it some text passage and asking questions about the text.

dmezzetti2y ago

If you want more on RAG with a concrete example: https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

sroussey2y ago

What is good for RAG?

andy992y ago

The smallest model your users agree meets their needs. It really depends.

The retrieval part is way more important.

I've used the original 13B instruction tuned llama2, quantized, and found it gives coherent answers about the context provided, ie the bottleneck was mostly getting good context.

When I played with long context models (like 16k tokens, and this was a few months ago, maybe they improved) they sucked.

1 more reply

dmezzetti2y ago

Link to model on HF Hub: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

ronsor2y ago

GitHub repo with links to the checkpoints: https://github.com/jzhang38/TinyLlama

sroussey2y ago

Needs an onnx folder to use it with transformer.js out of the box.

Hopefully @xenova will make a copy with it soon.

theaniketmaurya2y ago

Proud to see this work built using Lit-GPT coming through.

joelthelion2y ago

What would you use this for?

ofou2y ago

how does it compare it to phi-1?

matt1OP2y ago

OP here with a shameless plug: for anyone interested, I'm working on a site called Emergent Mind that surfaces trending AI/ML papers. This TinyLlama paper/repo is trending #1 right now and likely will be for a while due to how much attention it's getting across social media: https://www.emergentmind.com/papers/2401.02385. Emergent Mind also looks for and links to relevant discussions/resources on Reddit, X, HackerNews, GitHub, and YouTube for every new arXiv AI/ML paper. Feedback welcome!

ukuina2y ago

I visit your site every day. Thank you for creating it and evolving it past simple summaries to show paper details!

I recall you were looking to sell it at some point. Was wondering what that process looked like, and why you ended up holding on to the site.

matt1OP2y ago

Hey, thanks for the kind words.

To answer your question: an earlier version of the site focused on surfacing AI news, but that space is super competitive and I don't think Emergent Mind did a better job than the other resources out there. I tried selling it instead of just shutting it down, but ultimately decided to keep it. I recently decided to pivot to covering arXiv papers, which is a much better fit than AI news. I think there's an opportunity with it to not only help surface trending papers, but help educate people about them too using AI (the GPT-4 summaries are just a start). A lot of the future work will be focused in that direction, but I'd also love any feedback folks have on what I could add to make it more useful.

1 more reply

tmaly2y ago

I am new to this space. Is it hard to fine tune this model?

j / k navigate · click thread line to collapse

44 comments

minimaxir2y ago

But they did move down and that's what's important.

There should probably be more aggressive learning rate annealing for models trying to be Chinchilla-optimal instead of just cosine-with-warmup like every other model nowadays.

WhitneyLand2y ago

More aggressive learning rate? Wouldn’t we need more information on why they decided to stop training at this point to conclude that?

My current understanding of the story is, to recap:

- First the game was increase model size massively

- For example GPT3 had 175B parameters, but less than 0.5T tokens of training data

- Then Chinchilla showed for a given compute budget we can scale better by increasing training data

- Now we have models like this, and Phi, that have over 1T trained tokens

So getting back to your comment, I thought they were actually a multitude of indicators that should be used, not just validation loss, to determine what would be gained with more training.

zozbot2342y ago

Overfitting is quite unlikely with a smaller model though. Model parsimony provides a kind of regularization "for free", in fact with the extra benefit of saving on compute costs.

1 more reply

kiratp2y ago

You want to look at validation accuracy.

1 more reply

minimaxir2y ago

> Wouldn’t we need more information on why they decided to stop training at this point to conclude that?

The experiment was fixed at 3 epochs on 1T tokens, they didn't decide to "stop" at a given criterion.

> we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

The data I linked shows the validation loss, which has the same behavior as the training loss.

1 more reply

ssivark2y ago

How crucial is it to freeze the learning rate schedule a priori, instead of tweaking it on the fly?

minimaxir2y ago

If the learning rate is too high at a given point in training, it can result in either a) the model stopping learning or b) exploding gradients, which is very bad.

grandma_tea2y ago

Adaptive learning rate is a thing. For example, one scheme I've used before is to decrease the learning rate if the validation loss stops decreasing.

It's not clear to me if this is applicable to LLMs though.

TheCoreh2y ago

From the GitHub repo Readme:

> we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs

Edit: Did the math and just the GPUs (at 250W each) consumed around 8.64 MWh, which is at the same ballpark of the power consumption of the average US home in one year (10.5MWh)

chrismorgan2y ago

Of course, we’re probably both simplifying thing too much, but if these numbers are good enough it’s an interesting perspective.

At these sorts of costs and a final size of 2.2GB, each MB cost a few dollars to produce.

andy992y ago

https://github.com/rbitr/llm.f90/tree/optimize16/purefortran

xml2y ago

Here is another inference implementation in Python (only dependency is PyTorch).

https://github.com/99991/SimpleTinyLlama

The new checkpoints did not seem much better and they changed the chat format for some reason, so I did not port the new checkpoints yet. Perhaps I'll get to it this weekend.

refulgentis2y ago

1 more reply

eachro2y ago

What use cases would you say it is good enough for?

andy992y ago

That's the billion dollar question. These are all research models, the point was to see what happens when you keep training a smaller model.

rnd02y ago

>I wouldn't use either of them for RAG

What's RAG?

sp3322y ago

andy992y ago

Retrieval augmented generative, basically giving it some text passage and asking questions about the text.

dmezzetti2y ago

If you want more on RAG with a concrete example: https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

sroussey2y ago

What is good for RAG?

andy992y ago

The smallest model your users agree meets their needs. It really depends.

The retrieval part is way more important.

I've used the original 13B instruction tuned llama2, quantized, and found it gives coherent answers about the context provided, ie the bottleneck was mostly getting good context.

When I played with long context models (like 16k tokens, and this was a few months ago, maybe they improved) they sucked.

1 more reply

dmezzetti2y ago

Link to model on HF Hub: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

ronsor2y ago

GitHub repo with links to the checkpoints: https://github.com/jzhang38/TinyLlama

sroussey2y ago

Needs an onnx folder to use it with transformer.js out of the box.

Hopefully @xenova will make a copy with it soon.

theaniketmaurya2y ago

Proud to see this work built using Lit-GPT coming through.

joelthelion2y ago

What would you use this for?

ofou2y ago

how does it compare it to phi-1?

matt1OP2y ago

ukuina2y ago

I visit your site every day. Thank you for creating it and evolving it past simple summaries to show paper details!

I recall you were looking to sell it at some point. Was wondering what that process looked like, and why you ended up holding on to the site.

matt1OP2y ago

Hey, thanks for the kind words.

1 more reply

tmaly2y ago

I am new to this space. Is it hard to fine tune this model?

j / k navigate · click thread line to collapse