Fuyu-8B: A multimodal architecture for AI agents (opens in new tab)

(adept.ai)

205 pointsaverylamp2y ago57 comments

57 comments

Hey I work at Adept and helped make this! Happy to answer questions. The thing I think is especially neat/notable is how simple you can make the model architecture while still getting good performance. I expect we'll continue to see bits of these models get deleted in the next few years

Note that you can get the model weights on HuggingFace here: https://huggingface.co/adept/fuyu-8b

brianjking2y ago

First off, absolutely incredible work, congrats and thank you.

Secondly, do you anticipate Fuyu being made available for commercial access or will it remain NC?

JimDabell2y ago

What’s the situation with the license? Your blog post says you are open sourcing it, but it’s currently only available under a non-commercial license instead. Is an open source release forthcoming?

coder5432y ago

Yeah... in the blog post, they do explicitly mention "cc-by-nc", which I find disappointing.

Anything that Adept is "excited to see what the community builds on top of it" would only serve Adept and no one else! What incentive does the community have to build on top of Fuyu, when the community can't benefit from its own work? If Adept wants to benefit from word-of-mouth discussion of their models and from community contributions that make those models work better, as has happened dramatically with Llama 2, then they need to give the community the opportunity to benefit too.

Also weird: if you look at the tags on Hugging Face, you'll see it is listed as "cc". This comes from the README[0] metadata. "cc" is not really a license.

[0]: https://huggingface.co/adept/fuyu-8b/blob/main/README.md?cod...

schleck82y ago

It's open source by their definition, that is source available (open). Everyone always thinks the term open source is protected in any way while the entity that has established the commercial usage aspect is the Open Source Foundation. And noone is forced to abide by their ideology

FOSS meets the commercial usage requirement much better. Otherwise the term FOSS would be redundant.

mandelken2y ago

You can download the weights on Hugginface.

I believe the copyright on AI model weights in the US is not fully established, but so far it has been held that a list of numbers can not be copyrighted, so likely the same applies to model weights. Note that you don't have to enter into an agreement with Adept to use the model.

Alternatively, use and download the weights in Japan that has explicitly no copyright on AI models.

2 more replies

zan24342y ago

Hey! Awesome work. It seems like in theory this encoding scheme should enable the a model like this to generate images as well, by outputting image tokens, is that right?

abrichr2y ago

Thank you for the release!

What can you tell us about this:

> Our internal models (based on Fuyu) have extra capabilities related to our product. In particular,

> 1. They can reliably perform OCR on high-resolution images

> 2. They can do fine-grained localization of text and UI elements within those images

> 3. They can answer questions about images of UIs

Is this just a matter of additional fine tuning, or are there architectural differences?

amks2y ago

Even with experiments with just adding additional fine-tuning, we've seen models gain these capabilities!

Q6T46nT668w6i3m2y ago

Neat idea! Are the batches encoded as tokens into the input sequence? This is something I really like about the multi-modal PALM papers since it enables the multi-modal tokens to be referenced.

ekelsen2y ago

Image patches are projected directly into an embedding that goes into the decoder Transformer. The same thing could be done for audio.

saran9452y ago

Hi, Will it work for html/APP UI screenshots, Have been trained using UI screenshots ? Thank you

visarga2y ago

Do you offer paid API access to larger models?

acanb2y ago

can you guys launch a web gradio demo until the transformers PR gets approved? i'd like to play around with the model

fpgaminer2y ago

The architecture is quite compelling. I would not have expected it to work as well as it does. Glancing at the benchmarks it's basically on par with other VLMs in its class, despite having no separate image encoder.

Is there an associated paper? Or more specifically, details on the training dataset? It must have been a mix of text and VLM tasks, otherwise one or the other capability would have rotted during training. But I wonder if they trained off strictly VLM corpora, or also used plain image-text datasets like CLIP. It would be interesting if only the former.

Also makes me wonder if it could be trained on something like CommonCrawl where all the images are retained and interspersed correctly throughout the text. This model could theoretically train just fine off that, and it would unlock a whole new dataset effectively.

And has there been an inspection of what the model is outputting for predicted image "tokens"? Is it correctly predicting projected image patches to any degree of accuracy? And could therefore also generate images inline with text if another de-projection layer was trained?

Jackson__2y ago

I too would like to know about the training dataset, as I just took a look at the one for LLava[0], and found out that they used a pretty big amount of BLIP auto generated captions.

This seemed a bit surreal to me, like trying to train an LLM with the outputs of a worse performing smaller LLM.

[0] https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md#...

3abiton2y ago

This is the first multimodal model i hear about that is open source. Are there already other alternatives?

coder5432y ago

The Fuyu pre-trained model is not open source. At best, it is source-available. It's also not the only multimodal model you can run locally.

A few other examples include LLaVA[0], IDEFICS[1][2], and CogVLM[3]. Mini-GPT[4] might be another one to look at. I'm pretty sure all of these have better licenses than Fuyu. Fuyu's architecture does sound really interesting, but the license on the pre-trained model is a complete non-starter for almost anything.

[0]: https://github.com/haotian-liu/LLaVA

[1]: https://huggingface.co/blog/idefics

[2]: https://huggingface.co/HuggingFaceM4/idefics-80b-instruct

[3]: https://github.com/THUDM/CogVLM

[4]: https://github.com/Vision-CAIR/MiniGPT-4

1 more reply

joanfihu2y ago

I’ve done a review for UI navigation

https://joanfihu.wordpress.com/2023/10/19/evaluating-adepts-...

abunner2y ago

This is a really nice review. The examples helped me better understand the model's capabilities

abrichr2y ago

Thank you to the amazing team at Adept.ai for making this available!

For anyone interested in contributing to a fully open source alternative, join us at https://github.com/OpenAdaptAI/OpenAdapt

Lots of interesting work to be done, including integrating with Fuyu-8B!

coder5432y ago

"fully open source", but there is no license?

https://github.com/OpenAdaptAI/OpenAdapt/blob/30581e47fa9aec...

https://github.com/OpenAdaptAI/OpenAdapt/issues/246

And Fuyu is under a non-commercial license, so there's not much to be done with it unless someone trains a new Fuyu-architecture model from scratch.

abrichr2y ago

Thank you for pointing this out! You are correct that we have not yet decided on a license.

I will admit my ignorance on this topic, and I didn't want us to rush into selecting a license that is inappropriate.

Which one should we choose?

2 more replies

thatcherc2y ago

Really cool that the image patches are converted to tokens with just a linear projection instead of a big embedding model! I wonder if that trick will prove viable for other multimodel media like audio.

rafaelero2y ago

Not using embeddings/lookup table means they can't generate image/audio, which to me it's a severe limitation. Why bother going to the process of generating a multimodal transformer if it's able to generate nothing but text?

leodriesch2y ago

For an AI agent that should navigate a computer (which is Adepts use case IIRC) it should work, as it only has to output commands.

Philpax2y ago

Many applications only need input, not output.

mark_l_watson2y ago

This looks so cool, and from reading the Hugging Face model card it should be easy enough to run. I do almost all of my work with text, NLP, IR, etc., and I have wanted to try multi-modal models. I just bookmarked the model card page.

I am also getting even more excited by the explosion of work on open models. I still haven’t adjusted to how good mistral-7B is, and it runs on my Mac without breaking a sweat.

JimDabell2y ago

I gave it a shot on an M1 Max with 64GB RAM yesterday and it consumed all available RAM and hit a wall. I can run other, larger models without any problems so I assume it’s not an intrinsic limitation, but I didn’t spend any time debugging it.

Mistral-7B is incredible for its size!

yeldarb2y ago

This looks epic. Definitely going to explore adding it to Autodistill[1] this weekend. Any chance you'll be publicly releasing the internal OCR finetune?

[1] https://github.com/autodistill/autodistill

devinprater2y ago

Awesome! I can't wait to see how we can make local models for, say, describing images offline, or even getting a few screenshots of, say, a video game and describing what's going on.

stavros2y ago

This looks great! Is there any software that supports these? Llama.cpp, Ollama, LM studio, etc are really convenient, but I don't think they have image support yet?

paulkon2y ago

Can this be used to click around in the browser with text prompts? Maybe after some fine-tuning on screen recordings of specific workflows in browsers.

WanderPanda2y ago

Why don‘t these benchmarks judge the likelihood of the example answer? Just taking the MAP predictions seems like a waste of information

thefcpk2y ago

One thing that puzzles me is the lack of multilingual models... it is a bit sad to see everything through the English language.

snats2y ago

Yes, but currently there is a project called Aya[1] from Cohere4AI that I think it is trying to create multilingual models.

[1] aya.for.ai

tellarin2y ago

And the project is looking for contributors across many languages!

Full disclaimer: I'm a contributor and a big believer in the project.

ekelsen2y ago

I would try your language of interest...

StephenAshmore2y ago

Fascinating! I love seeing more multimodal ML. Thanks for sharing!

famouswaffles2y ago

Oh wow. This seems to be the best released vlm model. The chart/UI understanding displayed in particular is superb.

GaggiX2y ago

>This is by far the best open source vlm model

LLaVA 1.5 is very good, at least at describing images. http://llava.hliu.cc/

axiom922y ago

Right, but no separate image encoder + half the size could be very helpful for many applications.

1 more reply

lxe2y ago

Comparable with llava13b in benchmarks! Great work!

ronsor2y ago

Before someone else does, I'm going to point out that CC-BY-NC is technically not an open source license.

1 more reply

j / k navigate · click thread line to collapse

57 comments

tasdfqwer08972y ago

Note that you can get the model weights on HuggingFace here: https://huggingface.co/adept/fuyu-8b

brianjking2y ago

First off, absolutely incredible work, congrats and thank you.

Secondly, do you anticipate Fuyu being made available for commercial access or will it remain NC?

JimDabell2y ago

coder5432y ago

Yeah... in the blog post, they do explicitly mention "cc-by-nc", which I find disappointing.

Also weird: if you look at the tags on Hugging Face, you'll see it is listed as "cc". This comes from the README[0] metadata. "cc" is not really a license.

[0]: https://huggingface.co/adept/fuyu-8b/blob/main/README.md?cod...

schleck82y ago

FOSS meets the commercial usage requirement much better. Otherwise the term FOSS would be redundant.

mandelken2y ago

You can download the weights on Hugginface.

Alternatively, use and download the weights in Japan that has explicitly no copyright on AI models.

2 more replies

zan24342y ago

Hey! Awesome work. It seems like in theory this encoding scheme should enable the a model like this to generate images as well, by outputting image tokens, is that right?

abrichr2y ago

Thank you for the release!

What can you tell us about this:

> Our internal models (based on Fuyu) have extra capabilities related to our product. In particular,

> 1. They can reliably perform OCR on high-resolution images

> 2. They can do fine-grained localization of text and UI elements within those images

> 3. They can answer questions about images of UIs

Is this just a matter of additional fine tuning, or are there architectural differences?

amks2y ago

Even with experiments with just adding additional fine-tuning, we've seen models gain these capabilities!

Q6T46nT668w6i3m2y ago

Neat idea! Are the batches encoded as tokens into the input sequence? This is something I really like about the multi-modal PALM papers since it enables the multi-modal tokens to be referenced.

ekelsen2y ago

Image patches are projected directly into an embedding that goes into the decoder Transformer. The same thing could be done for audio.

saran9452y ago

Hi, Will it work for html/APP UI screenshots, Have been trained using UI screenshots ? Thank you

visarga2y ago

Do you offer paid API access to larger models?

acanb2y ago

can you guys launch a web gradio demo until the transformers PR gets approved? i'd like to play around with the model

fpgaminer2y ago

Jackson__2y ago

I too would like to know about the training dataset, as I just took a look at the one for LLava[0], and found out that they used a pretty big amount of BLIP auto generated captions.

This seemed a bit surreal to me, like trying to train an LLM with the outputs of a worse performing smaller LLM.

[0] https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md#...

3abiton2y ago

This is the first multimodal model i hear about that is open source. Are there already other alternatives?

coder5432y ago

The Fuyu pre-trained model is not open source. At best, it is source-available. It's also not the only multimodal model you can run locally.

[0]: https://github.com/haotian-liu/LLaVA

[1]: https://huggingface.co/blog/idefics

[2]: https://huggingface.co/HuggingFaceM4/idefics-80b-instruct

[3]: https://github.com/THUDM/CogVLM

[4]: https://github.com/Vision-CAIR/MiniGPT-4

1 more reply

joanfihu2y ago

I’ve done a review for UI navigation

https://joanfihu.wordpress.com/2023/10/19/evaluating-adepts-...

abunner2y ago

This is a really nice review. The examples helped me better understand the model's capabilities

abrichr2y ago

Thank you to the amazing team at Adept.ai for making this available!

For anyone interested in contributing to a fully open source alternative, join us at https://github.com/OpenAdaptAI/OpenAdapt

Lots of interesting work to be done, including integrating with Fuyu-8B!

coder5432y ago

"fully open source", but there is no license?

https://github.com/OpenAdaptAI/OpenAdapt/blob/30581e47fa9aec...

https://github.com/OpenAdaptAI/OpenAdapt/issues/246

And Fuyu is under a non-commercial license, so there's not much to be done with it unless someone trains a new Fuyu-architecture model from scratch.

abrichr2y ago

Thank you for pointing this out! You are correct that we have not yet decided on a license.

I will admit my ignorance on this topic, and I didn't want us to rush into selecting a license that is inappropriate.

Which one should we choose?

2 more replies

thatcherc2y ago

rafaelero2y ago

leodriesch2y ago

For an AI agent that should navigate a computer (which is Adepts use case IIRC) it should work, as it only has to output commands.

Philpax2y ago

Many applications only need input, not output.

mark_l_watson2y ago

I am also getting even more excited by the explosion of work on open models. I still haven’t adjusted to how good mistral-7B is, and it runs on my Mac without breaking a sweat.

JimDabell2y ago

Mistral-7B is incredible for its size!

yeldarb2y ago

This looks epic. Definitely going to explore adding it to Autodistill[1] this weekend. Any chance you'll be publicly releasing the internal OCR finetune?

[1] https://github.com/autodistill/autodistill

devinprater2y ago

Awesome! I can't wait to see how we can make local models for, say, describing images offline, or even getting a few screenshots of, say, a video game and describing what's going on.

stavros2y ago

This looks great! Is there any software that supports these? Llama.cpp, Ollama, LM studio, etc are really convenient, but I don't think they have image support yet?

paulkon2y ago

Can this be used to click around in the browser with text prompts? Maybe after some fine-tuning on screen recordings of specific workflows in browsers.

WanderPanda2y ago

Why don‘t these benchmarks judge the likelihood of the example answer? Just taking the MAP predictions seems like a waste of information

thefcpk2y ago

One thing that puzzles me is the lack of multilingual models... it is a bit sad to see everything through the English language.

snats2y ago

Yes, but currently there is a project called Aya[1] from Cohere4AI that I think it is trying to create multilingual models.

[1] aya.for.ai

tellarin2y ago

And the project is looking for contributors across many languages!

Full disclaimer: I'm a contributor and a big believer in the project.

ekelsen2y ago

I would try your language of interest...

StephenAshmore2y ago

Fascinating! I love seeing more multimodal ML. Thanks for sharing!

famouswaffles2y ago

Oh wow. This seems to be the best released vlm model. The chart/UI understanding displayed in particular is superb.

GaggiX2y ago

>This is by far the best open source vlm model

LLaVA 1.5 is very good, at least at describing images. http://llava.hliu.cc/

axiom922y ago

Right, but no separate image encoder + half the size could be very helpful for many applications.

1 more reply

lxe2y ago

Comparable with llava13b in benchmarks! Great work!

ronsor2y ago

Before someone else does, I'm going to point out that CC-BY-NC is technically not an open source license.

1 more reply

j / k navigate · click thread line to collapse