To clarify, while you can enable transcription to see what Ichigo says, Ichigo's design skips directly from audio to speech representations without creating a text transcription of the user’s input. This makes interactions faster but does mean that the user's spoken input isn't transcribed to text.
The flow we use is Speech → Encoder → Speech Representations → LLM → Text → TTS. By skipping the text step, we're able to speed things up and focus on the verbal experience.
Hope this clears things up!