Numbers every LLM Developer should know (opens in new tab)

(anyscale.com)

95 pointsdavidwu2y ago18 comments

18 comments

This is honestly a bit gross, as it's just a marketing piece.

The original numbers every programmer should know is a profound piece of pedagogy, aimed at helping programmers be better at their craft.

This is just an excerpt from a pitch deck.

bhickey2y ago

Flag it and move on.

fastball2y ago

Where can I find the original?

dijit2y ago

Its linked in the article: http://brenocon.com/dean_perf.html

chaosite2y ago

I like the interactive version: https://colin-scott.github.io/personal_website/research/inte...

crashocaster2y ago

Actually, the only numbers every LLM developer should know are their accelerator specs. For example:

A100 specs:

- 312e12 BF16 FLOPS

- 1555e9 GB/s HBM bandwidth

H100:

- 1000e12/2000e12 BF16/INT8 FLOPS

(apply ~0.7 flops efficiency multiplier because h100s power throttle extremely quickly)

- 3000 GB/s HBM bandwidth

---

For a 13B model on an A100, this nets:

13e9 * 2 bytes per param = 26 GB HBM required (at bf16)

26e9/1555e9 = 17ms / token small-batch latency (~60 tokens / second)

What about large batches?

latency for some batch size B is 13e9 * 2 FLOP per param * B / 312e12

We want B such that we're just about no longer HBM bound: 26e9/312e12 * B = 17ms

<=> 17e-3/(26e9/312e12)

giving a batch size of 204.

At that batch size (and all larger batch sizes), the a100 delivers a throughput of B * 1/17ms = 12000 tokens / second

---

KV caching, multi-gpu and -node comms and matmul efficiencies left as an exercise to the reader :)

vikp2y ago

I clicked because I thought they were defining LLM developer as "someone training LLMs", but instead they define it as "someone integrating LLMs into their application".

If you also had the same initial thought as me, this is an excellent article - https://blog.eleuther.ai/transformer-math/ .

luckyt2y ago

I had the same thought, but overall, there is probably an order of magnitude more people using LLMs in applications or fine-tuning them compared to those trying to pretrain LLMs from scratch.

marban2y ago

This will not age well.

regecks2y ago

What’s this “neural information retrieval system” thing about?

I’m just hacking away and presenting the LLM with some JSON data from our metrics database and making it answer user questions as a completion.

Is this embedding thing relevant for what I’m doing? Where should I start reading?

fullstackchris2y ago

I'm curious about the point on the embedding lookup cost... in my experience for an embedding lookup to be accurate, you have to include your entire document dataset to be queried against... obviously this can be just as expensive as querying a full cloud model if your dataset is very large. Interested if anyone had thoughts about this.

PeterStuer2y ago

Yes. I think the point is that the price per token for creating the embeddings using e.g. OpenAI's text-embedding-ada-002 api might be low, this will add up to some significant cost for a large document corpus. The suggestion to roll your own based on freely available embedding models is sound IMHO.

Now how to chunk those documents into semantically coherent pieces for context retrieval, that is the real challange though.

phreeza2y ago

There are very efficient algorithms for doing this, but of course it may still be expensive if your dataset is very large. See https://ann-benchmarks.com/ for some of the algorithms

Roark662y ago

The main thing every LLM should know is that ARM will eat x86_64's lunch in ML. Why? Because of the shared/unified memory model. M2 Ultra from apple can use up to 192GB of RAM. Even your smartphone thanks to this model can run networks a lot bigger than you would expect.

fweimer2y ago

AMD is reportedly working on APUs with decent amounts of HBM directly on the chip. It's not an x86-64 vs AArch64 thing.

Havoc2y ago

Don't think I've ever heard anyone call it "GRAM" instead of VRAM.

Another cost saving tip: On API, do combo calls where possible to dual use the input tokens. e.g.

"""You are an AI assistant that summarizes text given.

After the summarized text, add the word END.

After that answer the following questions with Yes or NO:

Is the text about Donald Trump?

Is the text about Space? """

Down side is now you need code to parse the output pieces & error handling around that

swores2y ago

I've found - when using ChatGPT with GPT4 - that sometimes when I ask it to do two things like that it will ignore my request to do one before the other and try to do both at the same time before providing a combined answer, unless I give even more specific instructions along the lines of "do not <do whatever> until after you have entirely finished and answered to completion <first thing>".

Just FYI in case anyone reading your comment tries your suggestion and has same issue, that with more firm instructions the problem can be avoided. Though I've not felt the need to experiment enough to understand exactly where the line is to avoid it trying to start one task too early without being wastefully verbose in the prompt.

golergka2y ago

Works much better with function calling when you just give it one function it has to call with two non optional arguments.

j / k navigate · click thread line to collapse