undefined | Better HN

0 pointsbeeboobaa31y ago0 comments

how is this even useful? no one can run it.

0 comments

You don't use the 405B parameter model at home. I have a lot of luck with 8B and 13B models on a single 3090. You can quantize them down (is that the term) which lowers precision and memory use, but still very usable... most of the time.

If you are running a commercial service that uses AI, you buy a few dozen A100s, spend a half million, and you are good for a while.

If you are running a commercial inferencing service, you spend tens of millions or get a cloud sponsor.

beeboobaa3OP1y ago

I can't expect all my users to have 3090s and if we're talking about spending millions there are better things to invest in than a stack of GPUs that will be obsolete in a year or three.

jermaustin11y ago

No, but if you are thinking about edge compute for LLMs, you quantize. Models are getting more efficient, and there are plenty of SLMs and smaller LLMs (like phi-2 or phi-3) that are plenty capable even on a tiny arm device like the current range of RPi "clones".

I have done experiments with 7B Llama3 Q8 models on a M3 MBP. They run faster than I can read, and only occasionally fall off the rails.

3B Phi-3 mini is almost instantaneous in simple responses on my MBP.

When I want longer context windows, I use a hosted service somewhere else, but if I only need 8000 tokens (99% of the time that is MORE than I need), any of my computers from the last 3 years are working just fine for it.

loudmax1y ago

If you want to run the 405B model without spending thousands of dollars on dedicated hardware, you rent compute from a datacenter. Meta lists AWS, Google and Microsoft among others as cloud partners.

But also check out the 8B and 70B Llama-3.1 models which show improved benchmarks over the Llama-3 models released in April.

TechDebtDevin1y ago

For sure, I don't really have a need to self host the 405b anyways. But if I did want to rent that compute we're talking $5+ /hr so you'd need to have a really good reason.

j / k navigate · click thread line to collapse

0 comments

jermaustin11y ago

If you are running a commercial service that uses AI, you buy a few dozen A100s, spend a half million, and you are good for a while.

If you are running a commercial inferencing service, you spend tens of millions or get a cloud sponsor.

beeboobaa3OP1y ago

I can't expect all my users to have 3090s and if we're talking about spending millions there are better things to invest in than a stack of GPUs that will be obsolete in a year or three.

jermaustin11y ago

I have done experiments with 7B Llama3 Q8 models on a M3 MBP. They run faster than I can read, and only occasionally fall off the rails.

3B Phi-3 mini is almost instantaneous in simple responses on my MBP.

loudmax1y ago

If you want to run the 405B model without spending thousands of dollars on dedicated hardware, you rent compute from a datacenter. Meta lists AWS, Google and Microsoft among others as cloud partners.

But also check out the 8B and 70B Llama-3.1 models which show improved benchmarks over the Llama-3 models released in April.

TechDebtDevin1y ago

For sure, I don't really have a need to self host the 405b anyways. But if I did want to rent that compute we're talking $5+ /hr so you'd need to have a really good reason.

j / k navigate · click thread line to collapse