undefined | Better HN

0 pointsmnky9800n2y ago0 comments

What makes it native?

0 comments

Good question.

Perhaps for audio and video is by directly integrating the spoken sound (audio mode -> LLM) rather than translating the sound to text and feeding the text to LLM (audio mode -> text mode -> LLM).

But to be honest I'm guessing here perhaps LLM experts (or LLM itself since they claimed comparable capability of human experts) can verify if this is truly what they meant by native multi-modal LLM.

Dantes-ai2y ago

It's highly unlikely for a generative model to be able to reason about language in this level when based on audio features alone. Gemini may use audio cues, but text tokens must be fed in the very early layers of the transformer for complex reasoning to be possible. But because the gemini paper only mentions a transformer architecture, I don't see a way for them to implement speech-to-text inside such architecture (while also allowing direct text input). Maybe native here means that such a stack of models was rather trained together.

teleforce2y ago

The transformer model and architecture is not limited to text-based token input but again I'm not the expert on how this new LLM model namely Gemini are being implemented, and whether the text-based token is necessary. For Gemini, if Google has truly cracked the native multi-modal input without the limitation of text-based input then it's really novel and revolutionary as they claimed it to be.

j / k navigate · click thread line to collapse

0 comments

teleforce2y ago

Good question.

Perhaps for audio and video is by directly integrating the spoken sound (audio mode -> LLM) rather than translating the sound to text and feeding the text to LLM (audio mode -> text mode -> LLM).

But to be honest I'm guessing here perhaps LLM experts (or LLM itself since they claimed comparable capability of human experts) can verify if this is truly what they meant by native multi-modal LLM.

Dantes-ai2y ago

teleforce2y ago

j / k navigate · click thread line to collapse