undefined | Better HN

0 pointsemreckartal1y ago0 comments

Ah, I see now.

To clarify, while you can enable transcription to see what Ichigo says, Ichigo's design skips directly from audio to speech representations without creating a text transcription of the user’s input. This makes interactions faster but does mean that the user's spoken input isn't transcribed to text.

The flow we use is Speech → Encoder → Speech Representations → LLM → Text → TTS. By skipping the text step, we're able to speed things up and focus on the verbal experience.

Hope this clears things up!

0 comments

lostmsu1y ago

I understand that. The problem is that in many scenarios users would want to see transcripts of what they said alongside the model output. Like if I have a chat with a model about choosing a place to move to, I would probably also want to review it later. And when I review it, I will see: me: /audio record/ AI: 200-300m. No easy way to see at glance what the AI answer was about.

readyplayeremma1y ago

You can just run whisper on the conversations as a background job populating the text versions of all the user inputs, so it doesn't interfere with the real-time latency.

lostmsu1y ago

It's not going to match what model hears.

j / k navigate · click thread line to collapse

0 pointsemreckartal1y ago0 comments

Ah, I see now.

The flow we use is Speech → Encoder → Speech Representations → LLM → Text → TTS. By skipping the text step, we're able to speed things up and focus on the verbal experience.

Hope this clears things up!

0 comments

lostmsu1y ago

readyplayeremma1y ago

You can just run whisper on the conversations as a background job populating the text versions of all the user inputs, so it doesn't interfere with the real-time latency.

lostmsu1y ago

It's not going to match what model hears.

j / k navigate · click thread line to collapse