undefined | Better HN

0 pointsrefulgentis2y ago0 comments

This is extremely misleading. source: been working in local LLMs since 10 months ago. Got my Mac laptop too. I'm bullish too. But we shouldn't breezily dismiss those concerns out of hand. In practice, it's single digit tokens a second on a $4500 laptop for a model with weights half this size (Llama 2 70B Q2 GGUF => 29 GB, Q8 => 36 GB)

0 comments

8 comments · 2 top-level

MacsHeadroom2y ago· 2 in thread

Mixtral 8x7b only needs 12B of weights in RAM per generation.

2B for the attention head and 5B from each of 2 experts.

It should be able to run slightly faster than a 13B desnse model, in as little as 16GB of RAM with room to spare.

filterfiber2y ago

> in as little as 16GB of RAM with room to spare.

I don't think that's the case, for full speed you still need (5B*8)/2+2~fewB overhead.

I think the experts chosen per-token? That means that yes you technically only need two in VRAM memory+router/overhead per token, but you'll have to constantly be loading in different experts unless you can fit them all, which would still be terrible for performance.

So you'll still be PCIE/RAM speed limited unless you can fit all of the experts into memory (or get really lucky and only need two experts).

dkarras2y ago

no doesn't work that way. experts can change per token so for interactive speeds you need all in memory unless you want to wait for model swaps between tokens.

coolspot2y ago· 4 in thread

> $4500

Which is more than a price of RTX A6000 48gb ($4k used on ebay)

brucethemoose22y ago

Which is outrageously priced, in case thats not clear. Its an 2020 RTX 3090 with doubled up memory ICs, which is not much extra BoM.

baq2y ago

Clearly it’s worth what people are willing to pay for it. At least it isn’t being used to compute hashes of virtual gold.

2 more replies

CamperBob22y ago

How fast does it run on that?

refulgentisOP2y ago

quantization makes it hard to have exactly one answer -- I'd make a q0 joke, except that's real now -- i.e. reduce the 3.4 * 10^38 range of float 32 to 2, a boolean.

it's not very good, at all, but now we can claim some pretty massive speedups.

I can't find anything for llama 2 70B on 4090 after 10 minutes of poking around, 13B is about 30 tkn/s. it looks like people generally don't run 70B unless they have multiple 4090s.

j / k navigate · click thread line to collapse