Voice —> Speech to Text -> LLM to determine intent -> JSON -> API call -> response -> LLM -> text to speech.
TTFT is irrelevant, you have to process everything through the pipeline before you can generate a response. A fast model is more important than a good model
Source: I do this kind of stuff for call centers. Yes I know modern LLMs don’t go through the voice -> text -> LLM -> text -> voice anymore. But that only works when you don’t have to call external sources
For example just now from the front page: https://news.ycombinator.com/item?id=47242637 "Speculative Speculative Decoding"
Or this: https://openreview.net/forum?id=960Ny6IjEr "Low-Rank Compression of Language Models Via Differentiable Rank Selection"
But low rank compression isn't trading off compute for memory - it's just compressing the model. And critically, that's lossy compression. That's primarily a trade-off of quality for speed/size, with a little bit of added compute. Same goals as quantization. If there was some compute-intensive lossless compression of parameters, lots of people would be happy. But those floating point values look a lot like gaussian noise, making them extremely difficult to compress.