undefined | Better HN

0 pointsvaline1y ago0 comments

It has to be a separate model doing the voice output right? I can’t imagine they’ve solved true multimodal output from a single model, they’d be bragging about it.

They’re probably predicting tone of voice tokens. Feed that into an audio transformer along with some speculative decoding to keep latency low.

0 comments

throwthrowuknow1y ago

The old voice mode was but everyone including gdb is saying that this one is natively multimodal once it’s fully rolled out, audio in audio out. This has been in the works for a while, you can look up papers on things like OCR-free document understanding and the like but the basic idea is you just train it and evaluate it on whatever media you want it to understand. As long as you can tokenize it it’ll work.

valineOP1y ago

It’s definitely multimodal input. Passing Clip embeddings to an LLM is nothing new, and that’s really all you need for document understanding. It’s almost certainly the same thing for audio. They would have trained a dual encoder that maps both audio and text to a shared embedding space.

What’s not at all clear to me is if they’re doing something special for output. Are you saying OpenAI has moved beyond next token prediction and just hasn’t bothered to mention it?

disconcision1y ago

why does this imply theyve moved beyond next token? is it a throughput issue?

1 more reply

j / k navigate · click thread line to collapse