Note that you can get the model weights on HuggingFace here: https://huggingface.co/adept/fuyu-8b
Secondly, do you anticipate Fuyu being made available for commercial access or will it remain NC?
Anything that Adept is "excited to see what the community builds on top of it" would only serve Adept and no one else! What incentive does the community have to build on top of Fuyu, when the community can't benefit from its own work? If Adept wants to benefit from word-of-mouth discussion of their models and from community contributions that make those models work better, as has happened dramatically with Llama 2, then they need to give the community the opportunity to benefit too.
Also weird: if you look at the tags on Hugging Face, you'll see it is listed as "cc". This comes from the README[0] metadata. "cc" is not really a license.
[0]: https://huggingface.co/adept/fuyu-8b/blob/main/README.md?cod...
FOSS meets the commercial usage requirement much better. Otherwise the term FOSS would be redundant.
I believe the copyright on AI model weights in the US is not fully established, but so far it has been held that a list of numbers can not be copyrighted, so likely the same applies to model weights. Note that you don't have to enter into an agreement with Adept to use the model.
Alternatively, use and download the weights in Japan that has explicitly no copyright on AI models.
What can you tell us about this:
> Our internal models (based on Fuyu) have extra capabilities related to our product. In particular,
> 1. They can reliably perform OCR on high-resolution images
> 2. They can do fine-grained localization of text and UI elements within those images
> 3. They can answer questions about images of UIs
Is this just a matter of additional fine tuning, or are there architectural differences?
Is there an associated paper? Or more specifically, details on the training dataset? It must have been a mix of text and VLM tasks, otherwise one or the other capability would have rotted during training. But I wonder if they trained off strictly VLM corpora, or also used plain image-text datasets like CLIP. It would be interesting if only the former.
Also makes me wonder if it could be trained on something like CommonCrawl where all the images are retained and interspersed correctly throughout the text. This model could theoretically train just fine off that, and it would unlock a whole new dataset effectively.
And has there been an inspection of what the model is outputting for predicted image "tokens"? Is it correctly predicting projected image patches to any degree of accuracy? And could therefore also generate images inline with text if another de-projection layer was trained?
This seemed a bit surreal to me, like trying to train an LLM with the outputs of a worse performing smaller LLM.
[0] https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md#...
A few other examples include LLaVA[0], IDEFICS[1][2], and CogVLM[3]. Mini-GPT[4] might be another one to look at. I'm pretty sure all of these have better licenses than Fuyu. Fuyu's architecture does sound really interesting, but the license on the pre-trained model is a complete non-starter for almost anything.
[0]: https://github.com/haotian-liu/LLaVA
[1]: https://huggingface.co/blog/idefics
[2]: https://huggingface.co/HuggingFaceM4/idefics-80b-instruct
https://joanfihu.wordpress.com/2023/10/19/evaluating-adepts-...
For anyone interested in contributing to a fully open source alternative, join us at https://github.com/OpenAdaptAI/OpenAdapt
Lots of interesting work to be done, including integrating with Fuyu-8B!
https://github.com/OpenAdaptAI/OpenAdapt/blob/30581e47fa9aec...
https://github.com/OpenAdaptAI/OpenAdapt/issues/246
And Fuyu is under a non-commercial license, so there's not much to be done with it unless someone trains a new Fuyu-architecture model from scratch.
I will admit my ignorance on this topic, and I didn't want us to rush into selecting a license that is inappropriate.
Which one should we choose?
I am also getting even more excited by the explosion of work on open models. I still haven’t adjusted to how good mistral-7B is, and it runs on my Mac without breaking a sweat.
Mistral-7B is incredible for its size!
[1] aya.for.ai
Full disclaimer: I'm a contributor and a big believer in the project.
LLaVA 1.5 is very good, at least at describing images. http://llava.hliu.cc/