What's interesting about this demo is the speed at which it is running, which demonstrates the "Groq LPU™ Inference Engine".
That's explained here: https://groq.com/lpu-inference-engine/
> This is the world’s first Language Processing Unit™ Inference Engine, purpose-built for inference performance and precision. How performant? Today, we are running Llama-2 70B at over 300 tokens per second per user.
I think the LPU is a custom hardware chip, though the page talking about it doesn't make that as clear as it could.
https://groq.com/products/ makes it a bit more clear - there's a custom chip, "GroqChip™ Processor".
https://groq.com/wp-content/uploads/2023/05/GroqISCAPaper202...
EDIT: i work at Groq, but i’m commenting in a personal capacity.
happy to answer clarifying questions or forward them along to folks who can :)
Are they also used for training or just inference?
I can’t find any information about an api, though I’m guessing that the costs are eye watering.
If they offered a Mixtral endpoint that did 300-400 tokens per second at a reasonable cost, I can’t imagine ever using another provider.
I actually have a final round interview with a subsidiary of Groq coming up and I'm very undecided as to whether to pursue it so this felt extraordinarily serendipitous to me. Food for thought shown here
But if it was generating high-quality responses, would that not make it go slower?
[0]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...
I asked about creating illicit substances — an obvious (and reasonable) target for censorship. And, admirably, it suggested getting help instead. That’s fine.
But I asked for a poem about pumping gas in the style of Charles Bukowski, and it moaned that I shouldn’t ask for such mean-spirited, rude things. It wouldn’t dare create such a travesty.
To test which underlying model I asked it what a good sexy message for my girlfriend for Valentine's Day would be, and it lectured me about objectification.
It makes sense the chat interface is using the chat model, I just wish that people were more consistent about labeling the use of Llama-2-chat vs Llama-2 as the fine tuning really does lead to significant underlying differences.
Really impressed by their hardware.
I'm still wondering why is the uptake so slow. My understanding from their presentations was that it was relatively simple to compile a model. Why isn't it more talked about? And why not demo Mixtral or show case multiple models?
edit: I re-ran the same prompt on perplexity llama-2-70b and getting 59 tokens per sec there
They probably weren't specific enough in mentioning what the model it was built on was, referring to Llama-2-chat as being a Llama-2 model (which is kind of correct).
This doesn't mean much without comparing $ or watts of GPU equivalents
You're right that it is important to compare cost per token also, not just raw speed. Unfortunately I don't have those figures to hand but I think our customer offerings are price competitive with OpenAI's offerings. The biggest takeaway though is that we just don't believe GPU architectures can ever scale to the performance that we can get, at any cost.
May as well wait for the whole response and render it. Or render paragraph at a time.
Don’t jiggle the UI while rendering.
But what are these LPUs optimized for: tensor operations (like Google's TPUs) or LLMs/Transformers architecture?
If it is the latter, how would they/their clients adapt if a new (improved) architecture hits the market?
It said December 2022, but the answers to another question was not correct for that time or now. It also went into some kind of repeating loop to its maximum response length.
Still pretty cool that our standards for chat programs have risen.
What are some relevant speed metrics? Output tokens per second? How about number of input tokens -- does that matter/how does that factor in.
Just FYI, you might want to fix autocorrect on iOS, your textbox seems to suppress it (at least for me).
I'd pay quite a bit of money to have a Mixtral box at home, then we'd all have our own, local assistant/helper/partner/whatever. Basically, the plot of the movie Her.
or is it a completely custom ASIC
"I am building an api in spring boot that persists users documents. This would be for an hr system. There are folders, and documents, which might have very sensitive data. I will need somewhere to store metadata about those documents. I was thinking of using postgres for the emtadata, and s3 for the actual documents. Any better ideas? or off the shelf libraries for this?"
Both were at about parity, except groq suggested using Spring Cloud Storage library, which GPT4 did not suggest. It turns out, that library might be great for my use case. I think OpenAI's days are numbered, the pressure for them to release the next gen model is very high.
Not only that, but GPT4 is quite slow, often times out, etc. These reponses are so much faster, which really does matter.