The original numbers every programmer should know is a profound piece of pedagogy, aimed at helping programmers be better at their craft.
This is just an excerpt from a pitch deck.
A100 specs:
- 312e12 BF16 FLOPS
- 1555e9 GB/s HBM bandwidth
H100:
- 1000e12/2000e12 BF16/INT8 FLOPS
(apply ~0.7 flops efficiency multiplier because h100s power throttle extremely quickly)
- 3000 GB/s HBM bandwidth
---
For a 13B model on an A100, this nets:
13e9 * 2 bytes per param = 26 GB HBM required (at bf16)
26e9/1555e9 = 17ms / token small-batch latency (~60 tokens / second)
What about large batches?
latency for some batch size B is 13e9 * 2 FLOP per param * B / 312e12
We want B such that we're just about no longer HBM bound: 26e9/312e12 * B = 17ms
<=> 17e-3/(26e9/312e12)
giving a batch size of 204.
At that batch size (and all larger batch sizes), the a100 delivers a throughput of B * 1/17ms = 12000 tokens / second
---
KV caching, multi-gpu and -node comms and matmul efficiencies left as an exercise to the reader :)
If you also had the same initial thought as me, this is an excellent article - https://blog.eleuther.ai/transformer-math/ .
I’m just hacking away and presenting the LLM with some JSON data from our metrics database and making it answer user questions as a completion.
Is this embedding thing relevant for what I’m doing? Where should I start reading?
Now how to chunk those documents into semantically coherent pieces for context retrieval, that is the real challange though.
Another cost saving tip: On API, do combo calls where possible to dual use the input tokens. e.g.
"""You are an AI assistant that summarizes text given.
After the summarized text, add the word END.
After that answer the following questions with Yes or NO:
Is the text about Donald Trump?
Is the text about Space? """
Down side is now you need code to parse the output pieces & error handling around that
Just FYI in case anyone reading your comment tries your suggestion and has same issue, that with more firm instructions the problem can be avoided. Though I've not felt the need to experiment enough to understand exactly where the line is to avoid it trying to start one task too early without being wastefully verbose in the prompt.