Depends on the machine, number of threads selected and the model checkpoint used (Vit-B or Vit-L or Vit-B). The video demo attached is running on Apple M2 Ultra and using the Vit-B model. The generation of the image embedding takes ~1.9s there and all the subsequent mask segmentations take ~45ms.
However, I am now focusing on improving the inference speed by making better use of ggml and trying out quantization. Once I make some progress in this direction I will compare to other SAM alternatives and benchmark more thoroughly.
We’ve been working on using them (often in conjunction with SAM) for auto-labeling datasets to train smaller faster models that can run in real-time at the edge: https://github.com/autodistill/autodistill
Really enjoyed using this app for iOS: https://github.com/mazzzystar/Queryable HN discussion: https://news.ycombinator.com/item?id=34686947
Or explore the dataset stable diffusion was trained on: https://news.ycombinator.com/item?id=32655497
The Bark python model is very compute intensive and require a powerful GPU to get bearable inference speed. I really hope that bark.cpp with GPU/Metal support and quanticized model can bring useful inference speed on a laptop in the near future.
Next Step: Incorporate this library into image editors like Photopea (via WebAssembly) to boost the speed of common selection tasks. The magic wand is a tool of the past.
I'd pay for such a feature.
Another popular optimisation is to port models to WASM + GPU because it makes them easy to support a variety of platforms (desktop, mobile...) with a single API and it can still offer great performance (see Google's mediapipe as an exemple of that).
Python is really great for fast prototyping. It can be argued most AI products so far are result of fast prototyping. So not sure if there is anything wrong with that.
As practical models emerge, at that point it indeed makes sense to port them to C++. But I would not in my wildest dreams suggest prototyping a data model in C++ unless absolutely necessary.
Maybe the act of compilation is an extra step, but I'd much rather have my development be in a high level language that is very suited to experimentation, probing, and testing, and then compile the final result down to something performant.
EDIT: I don't know much about the IOT world, and Tensorflow is likely a bad example as it's not designed to run on edge. So, I could understand that things like llama.cpp, GGML and GGUF are making strides towards easier runtimes. But I still think for dev-time, Python makes sense!