undefined | Better HN

0 pointsfulafel2mo ago0 comments

So it's not measuring output tokens/s, just how long it takes to start generating tokens. Seems we'll have to wait for independent benchmarks to get useful numbers.

0 comments

dotancohen2mo ago

For many workflows involving real time human interaction, such as voice assistant, this is the most important metric. Very few tasks are as sensitive to quality, once a certain response quality threshold has been achieved, as is the software planning and writing tasks that most HN readers are likely familiar.

raw_anon_11112mo ago

The way that voice assistants work even in the age of LLMs are:

Voice —> Speech to Text -> LLM to determine intent -> JSON -> API call -> response -> LLM -> text to speech.

TTFT is irrelevant, you have to process everything through the pipeline before you can generate a response. A fast model is more important than a good model

Source: I do this kind of stuff for call centers. Yes I know modern LLMs don’t go through the voice -> text -> LLM -> text -> voice anymore. But that only works when you don’t have to call external sources

garbageman2mo ago

I'm curious, what does the 'determine intent' mean in this case?

raw_anon_11112mo ago

An “intent” is something that a person wants to do - set a timer, get directions, etc.

A “slot” is the variable part of an intent. For instance “I want directions to 555 MockingBird Lane”. Would trigger a Directions intent that required where you are coming from and where you are going. Of course in that case it would assume your location.

Back in the pre LLM days and the way that Siri still works, someone had to manually list all of the different “utterances” that should trigger the intent - “Take me to {x}”,”I want to go to {x}” in every supported language and then had to have follow up phrases if someone just said something like “I need directions” to ask them something like “Where are you trying to go”.

Now you can do that with an LLM and some prompting and the LLM will keep going back and forth until all of the slots are filled and then tell it to create a JSON response when it has all of the information your API needs and you call your API.

This us what a prompt would look like to use a book a flight tool.

https://chatgpt.com/share/69a7d19f-494c-8010-8e9e-4e450f0bf0...

You also get the benefit of this works in any language not just English.

1 more reply

Art96812mo ago

It's going to be faster no matter what. My M3 MAX prints tokens faster than I can read for the new MoE models. It's the prompt processing that kills it when the context grows beyond a threshold which is easy to do in the modern agentic loops.

fulafelOP2mo ago

If your computer was faster at it, you could run more capable models at the same token rate.

oofbey2mo ago

Token/s is entirely determined by memory bandwidth. TTFT is compute bound.

fulafelOP2mo ago

This is broadly correct for currently favoured software, but in computer science optimization problems you can usually trade off compute for memory and vice versa.

For example just now from the front page: https://news.ycombinator.com/item?id=47242637 "Speculative Speculative Decoding"

Or this: https://openreview.net/forum?id=960Ny6IjEr "Low-Rank Compression of Language Models Via Differentiable Rank Selection"

oofbey2mo ago

Good point on speculative decoding techniques. I'd forgotten about them, and they're good. Would love to see some of these get into llama.cpp and friends, but it does require somebody to come up with a distilled draft model.

But low rank compression isn't trading off compute for memory - it's just compressing the model. And critically, that's lossy compression. That's primarily a trade-off of quality for speed/size, with a little bit of added compute. Same goals as quantization. If there was some compute-intensive lossless compression of parameters, lots of people would be happy. But those floating point values look a lot like gaussian noise, making them extremely difficult to compress.

saagarjha2mo ago

None of these really change the fundamental shape of the problem.

j / k navigate · click thread line to collapse

0 comments

dotancohen2mo ago

raw_anon_11112mo ago

The way that voice assistants work even in the age of LLMs are:

Voice —> Speech to Text -> LLM to determine intent -> JSON -> API call -> response -> LLM -> text to speech.

TTFT is irrelevant, you have to process everything through the pipeline before you can generate a response. A fast model is more important than a good model

garbageman2mo ago

I'm curious, what does the 'determine intent' mean in this case?

raw_anon_11112mo ago

An “intent” is something that a person wants to do - set a timer, get directions, etc.

This us what a prompt would look like to use a book a flight tool.

https://chatgpt.com/share/69a7d19f-494c-8010-8e9e-4e450f0bf0...

You also get the benefit of this works in any language not just English.

1 more reply

Art96812mo ago

fulafelOP2mo ago

If your computer was faster at it, you could run more capable models at the same token rate.

oofbey2mo ago

Token/s is entirely determined by memory bandwidth. TTFT is compute bound.

fulafelOP2mo ago

This is broadly correct for currently favoured software, but in computer science optimization problems you can usually trade off compute for memory and vice versa.

For example just now from the front page: https://news.ycombinator.com/item?id=47242637 "Speculative Speculative Decoding"

Or this: https://openreview.net/forum?id=960Ny6IjEr "Low-Rank Compression of Language Models Via Differentiable Rank Selection"

oofbey2mo ago

saagarjha2mo ago

None of these really change the fundamental shape of the problem.

j / k navigate · click thread line to collapse