undefined | Better HN

0 pointsabdullin2y ago0 comments

I'm not aware of any detailed write-ups. Mostly gathered information bit by bit.

On a high level here is how it is working for us:

0. When voice assistant device (ESP32) starts, it establishes web-socket connection to the server. 1. ESP32 chip is constantly running wake-word detection (there is one provided out-of-the-box by ESP-IDF framework (by Expressif) 2. Whenever a wake-word is detected (we trained a custom one, but you can use the ones provided by ESP), chip starts sending audio packets to the backend via web-sockets.

3. Backend collects all audio frames until there is a silence (using voice activity detection in Python). As soon as the instruction is over, tell the device to stop listening and:

4. Pass all collected audio segments to speech detection (using python with custom wav2vec). This gives us the text instruction.

5. Given a text instruction, you could trigger locally llama.cpp (or vLLM, if you have a GPU) or call remote API. It all depends on the system. We have a chain of LLM pipelines and RAG that compose our "business logic" across a bunch of AI skills. What's important - there is a text response in the end.

6. Pass the text response to speech-to-text model on the same machine, stream output back to the edge device.

7. Edge device (ESP32) will speak the words or play MP3 file you have sent the url to.

Does this help?

0 comments

GeoAtreides2y ago

Not OP, but amazing work, really really great! esp32-s3 are quite capable chips. Was it hard to train the custom wake-word?

abdullinOP2y ago

Thanks!

Custom wake-word on a chip is a bit of a pain. So we are running two models. One on the chip and the second, more powerful, on the server. It filters out false positives.

bambax2y ago

> Does this help?

Yes, thank you! Great description. Will try! ;-)

j / k navigate · click thread line to collapse