undefined | Better HN

0 pointsMacsHeadroom3y ago0 comments

I really want all this too. The smallest model is ~80mb and the largest is 3gb. Not sure about system requirements yet; but models that small suggest this may be doable locally on a single board computer.

Edit: According to this comment[0] the base model runs in real time on an M1 CPU. The tiny model apparently decodes an audio file twice as fast. These are promising results.

[0] https://news.ycombinator.com/item?id=32927360#32929739

0 comments

lunixbochs3y ago

For an offline (non-streaming) model, 1x realtime is actually kind of bad, because you need to wait for the audio to be available before you can start processing it. So if you wait 10 seconds for someone to finish speaking, you won't have the result until 10 seconds after that.

You could use really small chunk sizes and process them in a streaming fashion, but that would impact accuracy, as you're significantly limiting available context.

dom963y ago

I'd be interested to see how well it performs on something like an RPi. M1 is pretty beefy.

olao993y ago

To be more precise the original comment said "M1 Max" which in itself is significantly beefier a bare "M1"

j / k navigate · click thread line to collapse