On a high level here is how it is working for us:
0. When voice assistant device (ESP32) starts, it establishes web-socket connection to the server. 1. ESP32 chip is constantly running wake-word detection (there is one provided out-of-the-box by ESP-IDF framework (by Expressif) 2. Whenever a wake-word is detected (we trained a custom one, but you can use the ones provided by ESP), chip starts sending audio packets to the backend via web-sockets.
3. Backend collects all audio frames until there is a silence (using voice activity detection in Python). As soon as the instruction is over, tell the device to stop listening and:
4. Pass all collected audio segments to speech detection (using python with custom wav2vec). This gives us the text instruction.
5. Given a text instruction, you could trigger locally llama.cpp (or vLLM, if you have a GPU) or call remote API. It all depends on the system. We have a chain of LLM pipelines and RAG that compose our "business logic" across a bunch of AI skills. What's important - there is a text response in the end.
6. Pass the text response to speech-to-text model on the same machine, stream output back to the edge device.
7. Edge device (ESP32) will speak the words or play MP3 file you have sent the url to.
Does this help?