They made their own DINOv3 license for this release (whereas DINOv2 used the Apache 2.0 license).
Neat though. Will still check it out.
As a first comment, I had to install the latest transformer==4.56.0dev (e.g. pip install git+https://github.com/huggingface/transformers) for it to work properly. 4.55.2 and earlier was failing with a missing image type in the config.
Seems like the tides are shifting at meta
> ViT models pretrained on satellite dataset (SAT-493M)
DINOv2 had pretty poor out-of-the-box performance on satellite/aerial imagery, so it's super exciting that they released a version of it specifically for this use case.
For example (and this might be oversimplifying a bit, computer vision people please correct me if I’m wrong) if you’re interested in knowing whether or not the image contains a cat, then maybe there is some hyperplane P in H for which images on one side of P do not contain a cat, and images on the other side do contain a cat. And so solving for “Does this image contain a cat?”becomes a much easier problem, all you have to do is figure out what P is. Once you do that, you can pass your image into DINO, dot product with the equation for P, and check whether the answer is negative or positive. The point is that finding P is much easier than training your own computer vision model from scratch.
I’m fascinated by this, but am admittedly clueless about how to actually go about building any kind of recognizer or other system atop it.
[0]: https://github.com/tue-mps/eomt [1]: https://docs.lightly.ai/train/stable/semantic_segmentation.h...
As for doing it in general, it's a fairly standard vision transformer so anything built on DINOv2 (or any other ViT) should be easy to adapt to v3.
DINOV3: Self-supervised learning for vision at unprecedented scale | https://news.ycombinator.com/item?id=44904608