Federated finetuning of Whisper on Raspberry Pi 5 (opens in new tab)

(flower.dev)

90 pointsdanieljanes2y ago20 comments

20 comments

filterfiber2y ago

I don't think the article mentions it, how well does the rpi 4 and 5 do for inference with whisper especially v3?

coder5432y ago

v3 only comes in one flavor: large.

I don’t think you’re going to have a good time running the large model on a Pi of any kind.

The large models are 32x slower than the tiny models, roughly.[0]

I just tested, and whisper.cpp on my Pi 4 can transcribe the 30-second a13.wav sample (“make samples” to fetch it) in 18.5 seconds.

You can do the math… 32x = 10 minutes transcribe 30 seconds of audio with the large model. Not a good time for most people.

The Pi 5 could be 2x to 3x faster.

[0]: https://github.com/openai/whisper/blob/main/README.md#availa...

danieljanesOP2y ago

I can confirm that we're seeing 2x to 3x faster (RPi 4 vs RPi 5) in some of our early tests

jafermarq2y ago

yes. Finetuning a whisper model on a RPi 5 is ~2x faster than on the RPi 4. Other stages involving data pre-processing with HF dataset is again 2x-3x faster.

danieljanesOP2y ago

One of the Flower maintainers here, we're planning to follow up with a more in-depth performance comparison soon

a_wild_dandan2y ago

I’m also interested in peoples’ experience. I’d expect decent performance: Whisper 3 has many model sizes, down to 35Mb, iirc. Training, and especially inference, should be doable on a Pi5.

kkielhofner2y ago

> Whisper 3 has many model sizes

Nitpick but important - Whisper v2 and v3 are large only. It's actually the same Whisper but the version of the model (large-v2, large-v3) has been updated.

All of the other model sizes are the original release.

a_wild_dandan2y ago

I reread your comment multiple times and still don’t understand the important nitpick. Are you saying that the smaller models haven’t been updated alongside the Whisper 3 release? That makes the most sense to me, but I don’t want to misinterpret what you mean!

1 more reply

jafermarq2y ago

Yes. The example uses Whisper-tiny which is 39M, a perfect match for the downstream task of keyword spotting. Just one line needs to be changed in the code to run a larger Whisper model :)

ulnarkressty2y ago

How would this actually work in practice? Do I ask the user to utter specific words then train on that? How is it different from the traditional speech recognition that I need to 'train' to work better on my voice?

The Holy Grail would be to train the model while using it, without any friction. I don't think these methods support that though.

danieljanesOP2y ago

One of the Flower maintainers here. The code example is primarily meant as a demonstrator to show that it's possible to fine-tune these models in a federated way on devices as small as a Raspberry Pi 5.

The bigger takeaway is that we're close to being able to train/fine-tune models with much better performance by accessing vastly more data on the edge, in a federated way.

lfmunoz42y ago

The device on the edge creates the data but must also label it, right?

jafermarq2y ago

If the task requires labels yes. Alternatively, an auxiliary model (not part of the training) can generate pseudo labels and use those during training. A more general approach (for which a few works in FL have proven to work pretty well) would do instead un/semi-supervised training.

saqadri2y ago

This is cool. This might be a silly question, but what are the scenarios where it's useful for fine-tuning on the edge with small devices? I get inference on the edge, and curious about metrics on that for Whisper, but isn't it better to fine-tune on beefier infrastructure and then deploy it for inference on the edge?

danieljanesOP2y ago

The big opportunity on the edge is access to more data. Especially with the rise of end-to-end encryption, applications will be able to use more (and more diverse) data on the edge to get better model performance. It's generally true that training on beefier infrastructure is easier, but in the long run, nothing can beat access to better data. And edge hardware has gotten a lot faster over the last few years.

triyambakam2y ago

It seems like one benefit of fine tuning on the edge is the data doesn't need to move around as much. My father taught me "don't move a pile of dirt twice", so maybe it is like that.

FL33TW00D2y ago

Imagine fine tuning a personal LORA on the end users data. No privacy headaches but all the personalization.

Havoc2y ago

I’m guessing this will also help with thick accents?

jafermarq2y ago

yeah. with FL it should be possible to make sense out of all data that is distributed across devices without ever having to move it to a central location (i.e. collect it). In the case of speech data, users participating in a federated setting would likely come from different backgrounds, which could be reflected in their accent or use of language.

j / k navigate · click thread line to collapse

20 comments

filterfiber2y ago

I don't think the article mentions it, how well does the rpi 4 and 5 do for inference with whisper especially v3?

coder5432y ago

v3 only comes in one flavor: large.

I don’t think you’re going to have a good time running the large model on a Pi of any kind.

The large models are 32x slower than the tiny models, roughly.[0]

I just tested, and whisper.cpp on my Pi 4 can transcribe the 30-second a13.wav sample (“make samples” to fetch it) in 18.5 seconds.

You can do the math… 32x = 10 minutes transcribe 30 seconds of audio with the large model. Not a good time for most people.

The Pi 5 could be 2x to 3x faster.

[0]: https://github.com/openai/whisper/blob/main/README.md#availa...

danieljanesOP2y ago

I can confirm that we're seeing 2x to 3x faster (RPi 4 vs RPi 5) in some of our early tests

jafermarq2y ago

yes. Finetuning a whisper model on a RPi 5 is ~2x faster than on the RPi 4. Other stages involving data pre-processing with HF dataset is again 2x-3x faster.

danieljanesOP2y ago

One of the Flower maintainers here, we're planning to follow up with a more in-depth performance comparison soon

a_wild_dandan2y ago

I’m also interested in peoples’ experience. I’d expect decent performance: Whisper 3 has many model sizes, down to 35Mb, iirc. Training, and especially inference, should be doable on a Pi5.

kkielhofner2y ago

> Whisper 3 has many model sizes

Nitpick but important - Whisper v2 and v3 are large only. It's actually the same Whisper but the version of the model (large-v2, large-v3) has been updated.

All of the other model sizes are the original release.

a_wild_dandan2y ago

1 more reply

jafermarq2y ago

Yes. The example uses Whisper-tiny which is 39M, a perfect match for the downstream task of keyword spotting. Just one line needs to be changed in the code to run a larger Whisper model :)

ulnarkressty2y ago

The Holy Grail would be to train the model while using it, without any friction. I don't think these methods support that though.

danieljanesOP2y ago

The bigger takeaway is that we're close to being able to train/fine-tune models with much better performance by accessing vastly more data on the edge, in a federated way.

lfmunoz42y ago

The device on the edge creates the data but must also label it, right?

jafermarq2y ago

saqadri2y ago

danieljanesOP2y ago

triyambakam2y ago

It seems like one benefit of fine tuning on the edge is the data doesn't need to move around as much. My father taught me "don't move a pile of dirt twice", so maybe it is like that.

FL33TW00D2y ago

Imagine fine tuning a personal LORA on the end users data. No privacy headaches but all the personalization.

Havoc2y ago

I’m guessing this will also help with thick accents?

jafermarq2y ago

j / k navigate · click thread line to collapse