Video Surveillance with YOLO+llava (opens in new tab)

(github.com)

261 pointspsychip1y ago68 comments

68 comments

If you're interested in DIY security+AI, check out Frigate NVR(https://frigate.video/), Scrypted(https://www.scrypted.app/) and Viseron(https://viseron.netlify.app/).

gh02t1y ago

I've been using Frigate for a long time and it's a really cool project that has been quite reliable. The configuration can be a little bit of a headache to learn, but it gets better with every release.

Viserion is new to me though, that looks really cool.

taikon1y ago

I've been running frigate for a while now and I find it's object detection has a higher than preferred false-positive rate.

For instance, it kept thinking the tree in my back yard is a person. I find it hilarious that it often assigns a higher likelihood the tree is a person than me! I've needed to put a mask over the tree as a last resort.

acidburnNSA1y ago

Assuming the tree is big you can set max object areas for person and then it will never happen again. I had to do this with some areas where shadows looked like people in the afternoons.

dfc1y ago

I just recently got frigate up and running. How do the other two compare?

011000111y ago

Beats me, I'm just getting into this now. I started with a Reolink NVR, but it's a piece of crap, so I'm looking for a better alternative.

It looks like either Frigate or Viseron will do what I want. I started setting up Frigate, but realized I should downgrade my Reolink Duo 3 to a Duo 2 before I go too far. The Duo 3 really doesn't offer much better image quality but forces you to use h265 and consumes a lot more bandwidth. Once I stabilize my camera setup I'll get back to setting up both Frigate and Viseron and see what performs better. I like that the pro upgrade of Frigate allows you to customize the model and may make use of that.

yu3zhou41y ago

Congrats! What hardware you use to run the inference 24/7? I built a simpler version for running on low end hardware [0] for recognizing if there’s a person on my parcel, so I know someone have trespassed and I can launch siren, lights etc.

https://github.com/jmaczan/yolov3-tiny-openvino

pmontra1y ago

This runs with a Geforce GTX 1060. By a quick search it's 120 W. Maybe it's only the peak power consumption but it's still a lot. Do commercial products, if there are any, consume that much power?

moandcompany1y ago

There's a wide range of inference accelerators in commercial use.

For "edge" or embedded applications, an accelerator such as the Google Coral Edge TPU is a useful reference point where it is capable of up to 4 Trillion Operations per Second (4 TOPS), with up to 2 Watts of power consumption (2 TOPS/W), however the accelerator is limited to INT8 operations. It also has around 8 MB of memory for model storage.

Meanwhile a general purpose or gaming GPU can support a wider range of instructions, single-precision, double-precision floating point, integer, etc).

Geforce GTX 1060 for example: 4.375 TFLOPS (FP32) @ 120W (https://www.techpowerup.com/gpu-specs/geforce-gtx-1060-6-gb....)

There are commercial-oriented products that are optimized for particular operations and precision.

Here's a blog post discussing Google's 1st-generation ASIC TPU used in its datacenters: https://cloud.google.com/blog/products/ai-machine-learning/a...

(92 TOPS @ 700 Mhz - 40W)

https://arxiv.org/abs/1704.04760

janalsncm1y ago

Sorry I’m not familiar with TPUs only GPUs but how much VRAM do Corals have? YOLO 11x is 56M params which if it was quantized to int8 would still be 56MB. Plus you would need some for your inputs.

moandcompany1y ago

The Coral Edge TPU has approximately 8MB of SRAM for model weights/parameters.

https://coral.ai/docs/accelerator/datasheet/

It does not have VRAM as it is not a graphics card :)

There are examples and instructions for exporting Yolo variants to run on the Edge TPU: https://docs.ultralytics.com/guides/coral-edge-tpu-on-raspbe...

hcfman1y ago

I have something similar. It's not tracking though. Drawing around 10W on a pi, around 7W on a Jetson.

4ggr01y ago

not sure if i'm misunderstanding - you've got a similar GPU to a 1060 hooked up to a pi?

lelag1y ago

OP is probably using an AI accelerator like this: https://coral.ai/products/accelerator which works great on a PI and uses very little power. It will do the Yolo part, but you can't really expect it to do the multimodal LLM part, although you could try to run Florence directly on the PI too.

1 more reply

hcfman1y ago

Not a pi. A Jetson. Still an arm SBC though.

formerly_proven1y ago

YOLO is quick enough that you can just run it on a CPU, assuming you don’t want to run it at full resolution (no point) and full frame rate (ditto) for multiple streams. When you run it scaled down at a 2-3 fps you’ll get several streams per CPU core no problem. Energy use can be minimized by running a quick motion detection pass before, but that would obviously make the system miss things creeping through the frame pixel by pixel (very unlikely if you ask me)

phito1y ago

You can use a Coral USB Accelerator, doesn't use more than 10W.

Eisenstein1y ago

You can see here:

                res = rest(ollama, {

                    "model": "llava",

                    "prompt": genprompt(box.name),

                    "images": [box.export()],

                    "stream": False

                })

They are calling the ollama API to run Llava. Llava is a combo of an LLM base model and + vision projector (clip or ViT), and is usually around 4 - 8GB. Since every token generated needs access to all of the model weights, you would have to send 4 - 8 GB through USB with the Coral. Even at a generous 10gbit/s that is 8GB / 1.25GB = 6.4seconds per token. A 150 (short paragraph) generation would be 16minutes.

phito1y ago

Hm yeah sure, I didn't think of the llm part. I don't think it's really useful tbh.

nicholasjarnold1y ago

Can confirm. The Coral inference accelerator is quite performant with very low power draw. Once I figured out some passthrough and config issues I was able to run Frigate in an LXC container on Proxmox using Coral USB for inference. It's been working reliably 24/7 for months now.

hcfman1y ago

Yeah. But it’s likely it’s an 8-bit quantised, likely very small model with a small number of parameters. Which translates into poor recall and lots of false positives.

How many parameters is the model you are using with hailo? And what’s the quantisation and what model is it actually ?

phito1y ago

Honestly I have no idea what you are asking about. It's just dedicated hardware to a yolo-like object detection model

1 more reply

rocauc1y ago

A suggestion: I'd swap llava for Florence-2 for your open set text description. Florence-2 seems uniformly more descriptive in its outputs.

Eisenstein1y ago

They are using Ollama which is based on llama.cpp; florence is not supported on that backend.

jerpint1y ago

I found grounding-dino better than Florence and faster

netdur1y ago

I found YOLOS to be faster and better, bot real time but 22k objects under half second

xrd1y ago

I'm confused about why you need yolo and llava. Can't you simply use yolo without a multimodal LLM? What does that add? You can use yolo to detect and grab screen coordinates on its own, right?

andblac1y ago

Skimming through the source it seems to run 'car' and 'person' objects through llava with the following prompt:

- "person": "get gender and age of this person in 5 words or less",

- "car": "get body type and color of this car in 5 words or less".

So YOLO gives the bounding box and rough category, while llava describes the object in more details.

michaelt1y ago

Almost certainly using yolo to segment the cars, then llava for the more detailed "silver sedan" description

vaylian1y ago

Hello from the privacy crowd! Please use this responsibly. Tech can be a lot of fun and I encourage you to play around with things and I appreciate it when you push the boundaries of what is technically feasible. But please be mindful that surveillance tech can also be used to oppress people and infringe on their freedoms. Use tech for good!

1 more reply

matrik1y ago

MobileNetV3 and EfficientDet are othwr possible alternatives to YOLO. I was able to get higher than 1.5 FPS on Raspberry Pi Zero 2W which draws 1W on average. With efficient queuing approach, one can eliminate all bottlenecks.

ferar1y ago

Can you specify ideal hardware (camera, computer) to deploy the solution? Thanks

skirmish1y ago

Here are hardware recommendations from another similar (and well established) project: [1] [2]. Even though they don't recommend Reolink cameras, I have both Amcrest and Reolink cameras working well with Frigate for more than a year now.

[1] https://docs.frigate.video/frigate/hardware

[2] https://github.com/blakeblackshear/frigate

jamesbfb1y ago

+1 for Frigate and Reolink. I have it running in a Proxmox VM on an old dell r710 (yes, it’s sucks watts and needs replacing) but all said, Frigate, is, amazing! The ease of integration with HA is equally great.

moandcompany1y ago

Many Amcrest IP Cameras are manufactured by Dahua and use localized versions of Dahua firmware. The same applies to the Lorex brand in the United States.

Some things that matter when it comes to configuring your IP Cameras (Beyond security, etc): - Support for RTSP - Configurable Encoding Settings (e.g. h264 coded, bitrate, i-frame intervals, framerate) - Support for Substreams (i.e. a full-resolution main stream for recording, and at least one lower-resolution substream for preview/detection/etc) ...

Make sure the hardware you select is capable of the above.

Configurability will matter because Identification is not the same as Detection (Reference: "DORI" - Detection, Observation, Recognition, and Identification from IEC EN62676-4). If you want to be able to successfully identify objects or entities using your cameras, it will require more care than basic Observation or Detection.

hcfman1y ago

Isn’t it illegal now to import HIKvision and Dahua to the states now ?

1 more reply

moandcompany1y ago

You'll want to find an IP Camera that supports the RTSP protocol, which is most of them.

If your budget supports commercial style or commercial grade cameras, looking at Dahua or Hikvision manufactured cameras would be a good starting point to get an idea of specs, features, and cost.

meow_catrix1y ago

Maybe don’t buy surveillance hardware from those brands

sinuhe691y ago

Not OP, but the reason may be:

US - FCC Ban The US Federal Communications Commission (FCC) banned Dahua and Hikvision from new equipment authorizations in November 2022. Most products that use electricity require FCC equipment authorizations; otherwise, they are illegal to import, sell, market, or use, even for private individuals. Jul 5, 2024

1 more reply

moandcompany1y ago

A lot of the commercial-style or commercial-grade IP Cameras sold are rebadged Dahua or Hikvision products.

Compromised firmware or other backdoors are a concern for a wide range of products. With IP Cameras, a commonly recommended practice includes putting them on a non-internet accessible network, disabling any remote access, UPnP type features, etc. You can run IP cameras in an air-gapped configuration as well.

Home/consumer-grade cameras have plenty of shortcomings too.

1 more reply

avh021y ago

You're going to have to explain the reasoning here

1 more reply

nativeit1y ago

Could you elaborate? What’s up with those brands?

npteljes1y ago

I can recommend the Axis brand. Very user friendly while power user friendly as well, true local offerings. I personally bought mine used, it's an older model, and even then, it holds up really well.

toomuchtodo1y ago

+1 for Axis

llm_trw1y ago

Default yolo models are stuck at 640x640, so literally any camera that is at least capable of that resolution. Llava I believe is about the same. You'd need ubuntu and something that can run a llava model in vaguely real time, so a 4090/4080.

doctorhandshake1y ago

>> It calculates the center of every detection box, pinpoint on screen and gives 16px tolerance on all directions. Script tries to find closest object as fallback and creates a new object in memory in last resort. You can observe persistent objects in /elements folder

I’ve never implemented this kind of object persistence algo - is this a good approach? Seems naive but maybe that’s just because it’s simple.

nikolayasdf1231y ago

how about llama3.2 vision? should it get better performance?

_giorgio_1y ago

All I see, usually, is some AI YOLO algorithm applied to an offline video.

This is the first time that I've seen a "complete" setup. Any info to learn more on applying YOLO and similar models to real time streams (whatever the format)?

yeldarb1y ago

We’ve got an open source pipeline as part of inference[1] that handles the nuances (multithreading, batching, syncing, reconnecting) of running multiple real time streams (pass in an array of RTSP urls) for CV models like YOLO: https://blog.roboflow.com/vision-models-multiple-streams/

[1] https://github.com/roboflow/inference

llm_trw1y ago

Just stream it one frame at a time to the model and eat the latency: https://www.youtube.com/watch?v=IHbJcOex6dk if you need more hand holding.

There's a reason why there's a whole family of models from tiny to huge.

_giorgio_1y ago

Thanks for the link, but what happens when you have a video stream, be it a usb webcam, or an RTSP stream, and the hardware can't keep up?

I'm on windows.

Ideally I'd like the frames to be dropped, so the inference is done on the last received frame? Is this a standard behaviour?

yeldarb1y ago

If you do it naively your video frames will buffer waiting to be consumed causing a memory leak and eventual crash (or quick crash if you’re running on a device with constrained resources).

You really need to have a thread consuming the frames and feeding them to a worker that can run on its own clock.

_giorgio_1y ago

Sorry for the newbie question

Under windows, say that I have an RTSP stream (or something similar)

Would you use a single python script with which one of this multithreading solutions?

1 import concurrent.futures

2 import multiprocessing

3 import threading

llm_trw1y ago

That's not how loop devices work on Linux.

hug1y ago

This repository seems to be exactly what you are asking for. It's YOLO analysis of video frames passed in through Real Time Streaming Protocol.

_giorgio_1y ago

Yes, probably it's only one reasonably sized, let's say that with a lot of patience you can study it! I'll search for some online resources too.

I thought that this topic yolo object recognition would have much more following, instead there are really only a few projects.

https://github.com/search?q=yolo+rtsp&type=repositories&s=fo...

anshumankmr1y ago

Could try with Florence by Microsoft instead of Yolo and Llava, though the results are not going to be as great. Florence will do the inference on CPU. This is just for fun.

j / k navigate · click thread line to collapse

68 comments

011000111y ago

If you're interested in DIY security+AI, check out Frigate NVR(https://frigate.video/), Scrypted(https://www.scrypted.app/) and Viseron(https://viseron.netlify.app/).

gh02t1y ago

Viserion is new to me though, that looks really cool.

taikon1y ago

I've been running frigate for a while now and I find it's object detection has a higher than preferred false-positive rate.

acidburnNSA1y ago

Assuming the tree is big you can set max object areas for person and then it will never happen again. I had to do this with some areas where shadows looked like people in the afternoons.

dfc1y ago

I just recently got frigate up and running. How do the other two compare?

011000111y ago

Beats me, I'm just getting into this now. I started with a Reolink NVR, but it's a piece of crap, so I'm looking for a better alternative.

yu3zhou41y ago

https://github.com/jmaczan/yolov3-tiny-openvino

pmontra1y ago

This runs with a Geforce GTX 1060. By a quick search it's 120 W. Maybe it's only the peak power consumption but it's still a lot. Do commercial products, if there are any, consume that much power?

moandcompany1y ago

There's a wide range of inference accelerators in commercial use.

Meanwhile a general purpose or gaming GPU can support a wider range of instructions, single-precision, double-precision floating point, integer, etc).

Geforce GTX 1060 for example: 4.375 TFLOPS (FP32) @ 120W (https://www.techpowerup.com/gpu-specs/geforce-gtx-1060-6-gb....)

There are commercial-oriented products that are optimized for particular operations and precision.

Here's a blog post discussing Google's 1st-generation ASIC TPU used in its datacenters: https://cloud.google.com/blog/products/ai-machine-learning/a...

(92 TOPS @ 700 Mhz - 40W)

https://arxiv.org/abs/1704.04760

janalsncm1y ago

Sorry I’m not familiar with TPUs only GPUs but how much VRAM do Corals have? YOLO 11x is 56M params which if it was quantized to int8 would still be 56MB. Plus you would need some for your inputs.

moandcompany1y ago

The Coral Edge TPU has approximately 8MB of SRAM for model weights/parameters.

https://coral.ai/docs/accelerator/datasheet/

It does not have VRAM as it is not a graphics card :)

There are examples and instructions for exporting Yolo variants to run on the Edge TPU: https://docs.ultralytics.com/guides/coral-edge-tpu-on-raspbe...

hcfman1y ago

I have something similar. It's not tracking though. Drawing around 10W on a pi, around 7W on a Jetson.

4ggr01y ago

not sure if i'm misunderstanding - you've got a similar GPU to a 1060 hooked up to a pi?

lelag1y ago

1 more reply

hcfman1y ago

Not a pi. A Jetson. Still an arm SBC though.

formerly_proven1y ago

phito1y ago

You can use a Coral USB Accelerator, doesn't use more than 10W.

Eisenstein1y ago

You can see here:

                res = rest(ollama, {

                    "model": "llava",

                    "prompt": genprompt(box.name),

                    "images": [box.export()],

                    "stream": False

                })

phito1y ago

Hm yeah sure, I didn't think of the llm part. I don't think it's really useful tbh.

nicholasjarnold1y ago

hcfman1y ago

Yeah. But it’s likely it’s an 8-bit quantised, likely very small model with a small number of parameters. Which translates into poor recall and lots of false positives.

How many parameters is the model you are using with hailo? And what’s the quantisation and what model is it actually ?

phito1y ago

Honestly I have no idea what you are asking about. It's just dedicated hardware to a yolo-like object detection model

1 more reply

rocauc1y ago

A suggestion: I'd swap llava for Florence-2 for your open set text description. Florence-2 seems uniformly more descriptive in its outputs.

Eisenstein1y ago

They are using Ollama which is based on llama.cpp; florence is not supported on that backend.

jerpint1y ago

I found grounding-dino better than Florence and faster

netdur1y ago

I found YOLOS to be faster and better, bot real time but 22k objects under half second

xrd1y ago

I'm confused about why you need yolo and llava. Can't you simply use yolo without a multimodal LLM? What does that add? You can use yolo to detect and grab screen coordinates on its own, right?

andblac1y ago

Skimming through the source it seems to run 'car' and 'person' objects through llava with the following prompt:

- "person": "get gender and age of this person in 5 words or less",

- "car": "get body type and color of this car in 5 words or less".

So YOLO gives the bounding box and rough category, while llava describes the object in more details.

michaelt1y ago

Almost certainly using yolo to segment the cars, then llava for the more detailed "silver sedan" description

vaylian1y ago

1 more reply

matrik1y ago

ferar1y ago

Can you specify ideal hardware (camera, computer) to deploy the solution? Thanks

skirmish1y ago

[1] https://docs.frigate.video/frigate/hardware

[2] https://github.com/blakeblackshear/frigate

jamesbfb1y ago

moandcompany1y ago

Many Amcrest IP Cameras are manufactured by Dahua and use localized versions of Dahua firmware. The same applies to the Lorex brand in the United States.

Make sure the hardware you select is capable of the above.

hcfman1y ago

Isn’t it illegal now to import HIKvision and Dahua to the states now ?

1 more reply

moandcompany1y ago

You'll want to find an IP Camera that supports the RTSP protocol, which is most of them.

If your budget supports commercial style or commercial grade cameras, looking at Dahua or Hikvision manufactured cameras would be a good starting point to get an idea of specs, features, and cost.

meow_catrix1y ago

Maybe don’t buy surveillance hardware from those brands

sinuhe691y ago

Not OP, but the reason may be:

1 more reply

moandcompany1y ago

A lot of the commercial-style or commercial-grade IP Cameras sold are rebadged Dahua or Hikvision products.

Home/consumer-grade cameras have plenty of shortcomings too.

1 more reply

avh021y ago

You're going to have to explain the reasoning here

1 more reply

nativeit1y ago

Could you elaborate? What’s up with those brands?

npteljes1y ago

I can recommend the Axis brand. Very user friendly while power user friendly as well, true local offerings. I personally bought mine used, it's an older model, and even then, it holds up really well.

toomuchtodo1y ago

+1 for Axis

llm_trw1y ago

doctorhandshake1y ago

I’ve never implemented this kind of object persistence algo - is this a good approach? Seems naive but maybe that’s just because it’s simple.

nikolayasdf1231y ago

how about llama3.2 vision? should it get better performance?

_giorgio_1y ago

All I see, usually, is some AI YOLO algorithm applied to an offline video.

This is the first time that I've seen a "complete" setup. Any info to learn more on applying YOLO and similar models to real time streams (whatever the format)?

yeldarb1y ago

[1] https://github.com/roboflow/inference

llm_trw1y ago

Just stream it one frame at a time to the model and eat the latency: https://www.youtube.com/watch?v=IHbJcOex6dk if you need more hand holding.

There's a reason why there's a whole family of models from tiny to huge.

_giorgio_1y ago

Thanks for the link, but what happens when you have a video stream, be it a usb webcam, or an RTSP stream, and the hardware can't keep up?

I'm on windows.

Ideally I'd like the frames to be dropped, so the inference is done on the last received frame? Is this a standard behaviour?

yeldarb1y ago

If you do it naively your video frames will buffer waiting to be consumed causing a memory leak and eventual crash (or quick crash if you’re running on a device with constrained resources).

You really need to have a thread consuming the frames and feeding them to a worker that can run on its own clock.

_giorgio_1y ago

Sorry for the newbie question

Under windows, say that I have an RTSP stream (or something similar)

Would you use a single python script with which one of this multithreading solutions?

1 import concurrent.futures

2 import multiprocessing

3 import threading

llm_trw1y ago

That's not how loop devices work on Linux.

hug1y ago

This repository seems to be exactly what you are asking for. It's YOLO analysis of video frames passed in through Real Time Streaming Protocol.

_giorgio_1y ago

Yes, probably it's only one reasonably sized, let's say that with a lot of patience you can study it! I'll search for some online resources too.

I thought that this topic yolo object recognition would have much more following, instead there are really only a few projects.

https://github.com/search?q=yolo+rtsp&type=repositories&s=fo...

anshumankmr1y ago

Could try with Florence by Microsoft instead of Yolo and Llava, though the results are not going to be as great. Florence will do the inference on CPU. This is just for fun.

j / k navigate · click thread line to collapse