Viserion is new to me though, that looks really cool.
For instance, it kept thinking the tree in my back yard is a person. I find it hilarious that it often assigns a higher likelihood the tree is a person than me! I've needed to put a mask over the tree as a last resort.
It looks like either Frigate or Viseron will do what I want. I started setting up Frigate, but realized I should downgrade my Reolink Duo 3 to a Duo 2 before I go too far. The Duo 3 really doesn't offer much better image quality but forces you to use h265 and consumes a lot more bandwidth. Once I stabilize my camera setup I'll get back to setting up both Frigate and Viseron and see what performs better. I like that the pro upgrade of Frigate allows you to customize the model and may make use of that.
For "edge" or embedded applications, an accelerator such as the Google Coral Edge TPU is a useful reference point where it is capable of up to 4 Trillion Operations per Second (4 TOPS), with up to 2 Watts of power consumption (2 TOPS/W), however the accelerator is limited to INT8 operations. It also has around 8 MB of memory for model storage.
Meanwhile a general purpose or gaming GPU can support a wider range of instructions, single-precision, double-precision floating point, integer, etc).
Geforce GTX 1060 for example: 4.375 TFLOPS (FP32) @ 120W (https://www.techpowerup.com/gpu-specs/geforce-gtx-1060-6-gb....)
There are commercial-oriented products that are optimized for particular operations and precision.
Here's a blog post discussing Google's 1st-generation ASIC TPU used in its datacenters: https://cloud.google.com/blog/products/ai-machine-learning/a...
(92 TOPS @ 700 Mhz - 40W)
res = rest(ollama, {
"model": "llava",
"prompt": genprompt(box.name),
"images": [box.export()],
"stream": False
})
They are calling the ollama API to run Llava. Llava is a combo of an LLM base model and + vision projector (clip or ViT), and is usually around 4 - 8GB. Since every token generated needs access to all of the model weights, you would have to send 4 - 8 GB through USB with the Coral. Even at a generous 10gbit/s that is 8GB / 1.25GB = 6.4seconds per token. A 150 (short paragraph) generation would be 16minutes.How many parameters is the model you are using with hailo? And what’s the quantisation and what model is it actually ?
- "person": "get gender and age of this person in 5 words or less",
- "car": "get body type and color of this car in 5 words or less".
So YOLO gives the bounding box and rough category, while llava describes the object in more details.
Some things that matter when it comes to configuring your IP Cameras (Beyond security, etc): - Support for RTSP - Configurable Encoding Settings (e.g. h264 coded, bitrate, i-frame intervals, framerate) - Support for Substreams (i.e. a full-resolution main stream for recording, and at least one lower-resolution substream for preview/detection/etc) ...
Make sure the hardware you select is capable of the above.
Configurability will matter because Identification is not the same as Detection (Reference: "DORI" - Detection, Observation, Recognition, and Identification from IEC EN62676-4). If you want to be able to successfully identify objects or entities using your cameras, it will require more care than basic Observation or Detection.
If your budget supports commercial style or commercial grade cameras, looking at Dahua or Hikvision manufactured cameras would be a good starting point to get an idea of specs, features, and cost.
I’ve never implemented this kind of object persistence algo - is this a good approach? Seems naive but maybe that’s just because it’s simple.
This is the first time that I've seen a "complete" setup. Any info to learn more on applying YOLO and similar models to real time streams (whatever the format)?
There's a reason why there's a whole family of models from tiny to huge.
I'm on windows.
Ideally I'd like the frames to be dropped, so the inference is done on the last received frame? Is this a standard behaviour?
You really need to have a thread consuming the frames and feeding them to a worker that can run on its own clock.
I thought that this topic yolo object recognition would have much more following, instead there are really only a few projects.
https://github.com/search?q=yolo+rtsp&type=repositories&s=fo...