I don’t think you’re going to have a good time running the large model on a Pi of any kind.
The large models are 32x slower than the tiny models, roughly.[0]
I just tested, and whisper.cpp on my Pi 4 can transcribe the 30-second a13.wav sample (“make samples” to fetch it) in 18.5 seconds.
You can do the math… 32x = 10 minutes transcribe 30 seconds of audio with the large model. Not a good time for most people.
The Pi 5 could be 2x to 3x faster.
[0]: https://github.com/openai/whisper/blob/main/README.md#availa...
Nitpick but important - Whisper v2 and v3 are large only. It's actually the same Whisper but the version of the model (large-v2, large-v3) has been updated.
All of the other model sizes are the original release.
The Holy Grail would be to train the model while using it, without any friction. I don't think these methods support that though.
The bigger takeaway is that we're close to being able to train/fine-tune models with much better performance by accessing vastly more data on the edge, in a federated way.