> ML multi-speaker speech-to-text every conversation
I honestly love this idea. Any suggestions on what FOSS tooling to use for said speech-to-text that's reasonably accurate? Or is training the ML the "heavy lift" of this setup?
Google has an API for this. Speech is a thing you really need big data for and thus IMO not suitable for FOSS. Why bother with setting up the whole custom data pipeline when plug-and-play is available for a fraction of the cost.