DINOv3 (opens in new tab)

(github.com)

179 pointsreqo9mo ago31 comments

31 comments

You have to share your contact information, including DoB, and then be approved access, to obtain the models, and given that it's Meta I assume they're actually validating it against their All Humans database.

They made their own DINOv3 license for this release (whereas DINOv2 used the Apache 2.0 license).

Neat though. Will still check it out.

As a first comment, I had to install the latest transformer==4.56.0dev (e.g. pip install git+https://github.com/huggingface/transformers) for it to work properly. 4.55.2 and earlier was failing with a missing image type in the config.

Qwuke9mo ago

Yes, it's pretty disappointing for a seemingly big improvement over SOTA to be commercially licensed compared the previous version.. At least in the press release they're not portraying it as open source just because it's on GitHub/HuggingFace.

tough9mo ago

The new Facebook AI Czar wang hinted on previous interviews that Meta might change their stand on licensing/open source.

Seems like the tides are shifting at meta

rajman1879mo ago

This has nothing to do with the newly appointed fellow nor Meta Superintelligence Labs, but rather work from FAIR that would have gone through a lengthy review process before seeing the light of day. Not fun to see the license change in any case

2 more replies

beklein9mo ago

- Blog post: https://ai.meta.com/blog/dinov3-self-supervised-vision-model... - Paper: https://ai.meta.com/research/publications/dinov3/ - Hugging Face: https://huggingface.co/collections/facebook/dinov3-68924841b...

fnands9mo ago

As someone who works on satellite imagery, this part is incredibly exciting:

> ViT models pretrained on satellite dataset (SAT-493M)

DINOv2 had pretty poor out-of-the-box performance on satellite/aerial imagery, so it's super exciting that they released a version of it specifically for this use case.

Imnimo9mo ago

I think SAM and DINO are the two off-the-shelf image models I've gotten the most mileage out of.

ranger_danger9mo ago

I have no idea what this even is.

ethan_smith9mo ago

DINO (Distillation with No labels) is a self-supervised computer vision framework that learns powerful image representations without requiring labeled data. It's particularly valuable for downstream tasks like object detection and segmentation, with DINOv3 now scaling to over 1B parameters and trained on 1.2B images.

n3storm9mo ago

D3NO?

kaoD9mo ago

> An extended family of versatile vision foundation models producing high-quality dense features and achieving outstanding performance on various vision tasks including outperforming the specialized state of the art across a broad range of settings, without fine-tuning

kevinventullo9mo ago

To elaborate, this is a foundation model. This basically means it can take an arbitrary image and map it to a high dimensional space H in which ~arbitrary characteristics become much easier to solve for.

For example (and this might be oversimplifying a bit, computer vision people please correct me if I’m wrong) if you’re interested in knowing whether or not the image contains a cat, then maybe there is some hyperplane P in H for which images on one side of P do not contain a cat, and images on the other side do contain a cat. And so solving for “Does this image contain a cat?”becomes a much easier problem, all you have to do is figure out what P is. Once you do that, you can pass your image into DINO, dot product with the equation for P, and check whether the answer is negative or positive. The point is that finding P is much easier than training your own computer vision model from scratch.

hgo9mo ago

Thanks, I think I understand roughly. Could it be used for for recognizing people? As in identifying what person is in what image?

2 more replies

phrotoma9mo ago

Damn this is an incredibly informative comment. Thanks for taking the time. This helped make things click for me.

reactordev9mo ago

If computer vision were semantic search, nailed it. It’s a little more complicated than that but - with this new model, not by much :D

kristopolous9mo ago

This is still pretty non-specific, however, luckily, they have a landing page:

https://ai.meta.com/dinov3/

pugworthy9mo ago

So, a group of AI models that can look at images and understand them, working well for many different tasks without needing extra training?

stingraycharles9mo ago

Exactly that. It’s easy to see why Meta, which Facebook and Instagram, is so heavily invested in these types of AI models.

ranger_danger9mo ago

English, doc

cobbzilla9mo ago

Could anyone point to an example or git repo showing a simple implementation?

I’m fascinated by this, but am admittedly clueless about how to actually go about building any kind of recognizer or other system atop it.

nonocycle9mo ago

You can pretty much use it as a drop-in replacement for anything built on top of DINOv2. E.g. if you want to fine-tune a segmentation model you can use EoMT[0] which uses DINOv2 as backbone and replace the backbone with DINOv3. If you just want to run it you can give LightlyTrain a spin [1]. There should also be support in the original EoMT repo soon. The methods in the DINOv3 paper focus on frozen backbones which are usually faster to train but might have lower performance than full fine-tuning.

[0]: https://github.com/tue-mps/eomt [1]: https://docs.lightly.ai/train/stable/semantic_segmentation.h...

fnands9mo ago

Their repo has some example notebooks: https://github.com/facebookresearch/dinov3/tree/main/noteboo...

As for doing it in general, it's a fairly standard vision transformer so anything built on DINOv2 (or any other ViT) should be easy to adapt to v3.

barbolo9mo ago

That's awesome. DINOv2 was the best image embedder until now.

deepsquirrelnet9mo ago

If I’m already using siglip2 for a clustering application, is this enough of a an uplift that I should look at it?

PhilippGille9mo ago

This was submitted earlier:

DINOV3: Self-supervised learning for vision at unprecedented scale | https://news.ycombinator.com/item?id=44904608

j / k navigate · click thread line to collapse

31 comments

llm_nerd9mo ago

They made their own DINOv3 license for this release (whereas DINOv2 used the Apache 2.0 license).

Neat though. Will still check it out.

Qwuke9mo ago

tough9mo ago

The new Facebook AI Czar wang hinted on previous interviews that Meta might change their stand on licensing/open source.

Seems like the tides are shifting at meta

rajman1879mo ago

2 more replies

beklein9mo ago

fnands9mo ago

As someone who works on satellite imagery, this part is incredibly exciting:

> ViT models pretrained on satellite dataset (SAT-493M)

DINOv2 had pretty poor out-of-the-box performance on satellite/aerial imagery, so it's super exciting that they released a version of it specifically for this use case.

Imnimo9mo ago

I think SAM and DINO are the two off-the-shelf image models I've gotten the most mileage out of.

ranger_danger9mo ago

I have no idea what this even is.

ethan_smith9mo ago

n3storm9mo ago

D3NO?

kaoD9mo ago

kevinventullo9mo ago

hgo9mo ago

Thanks, I think I understand roughly. Could it be used for for recognizing people? As in identifying what person is in what image?

2 more replies

phrotoma9mo ago

Damn this is an incredibly informative comment. Thanks for taking the time. This helped make things click for me.

reactordev9mo ago

If computer vision were semantic search, nailed it. It’s a little more complicated than that but - with this new model, not by much :D

kristopolous9mo ago

This is still pretty non-specific, however, luckily, they have a landing page:

https://ai.meta.com/dinov3/

pugworthy9mo ago

So, a group of AI models that can look at images and understand them, working well for many different tasks without needing extra training?

stingraycharles9mo ago

Exactly that. It’s easy to see why Meta, which Facebook and Instagram, is so heavily invested in these types of AI models.

ranger_danger9mo ago

English, doc

cobbzilla9mo ago

Could anyone point to an example or git repo showing a simple implementation?

I’m fascinated by this, but am admittedly clueless about how to actually go about building any kind of recognizer or other system atop it.

nonocycle9mo ago

[0]: https://github.com/tue-mps/eomt [1]: https://docs.lightly.ai/train/stable/semantic_segmentation.h...

fnands9mo ago

Their repo has some example notebooks: https://github.com/facebookresearch/dinov3/tree/main/noteboo...

As for doing it in general, it's a fairly standard vision transformer so anything built on DINOv2 (or any other ViT) should be easy to adapt to v3.

barbolo9mo ago

That's awesome. DINOv2 was the best image embedder until now.

deepsquirrelnet9mo ago

If I’m already using siglip2 for a clustering application, is this enough of a an uplift that I should look at it?

PhilippGille9mo ago

This was submitted earlier:

DINOV3: Self-supervised learning for vision at unprecedented scale | https://news.ycombinator.com/item?id=44904608

j / k navigate · click thread line to collapse