Meta-Transformer: A unified framework for multimodal learning (opens in new tab)

(kxgong.github.io)

106 pointsulrikhansen542y ago35 comments

35 comments

Yo dawg, we heard you like transformers so we put transformers on your transformers so you can train while you train. The spider web graph shows metatransformers performing worse to their counterparts in almost all fields. Is there a reason I should not believe that an expert model will always outperform a general purpose one, even if it's a metatransformer?

danielbln2y ago

I mean, there is a somewhat unique value proposition of a multimodal framework like this meta transfirmer. Its goal isn't necessarily to beat expert models in their own game, but to provide a unified framework for processing diverse modalities of data.

I think it aims to leverage the cross-modal relationships and unified learning, which might not be possible with expert models designed for only a single modality.

Even if it performs slightly worse on some tasks, the ability to handle multiple modalities within a single framework is an pretty sweet advantage in scenarios where data from various sources need to be processed simultaneously, and patterns across modalities need to be captured somehow.

A general-purpose model could also be a more cost-effective solution in some cases, ensemble experts are difficult to scale and parallelize.

AndrewKemendo2y ago

>an expert model will always outperform a general purpose one, even if it's a metatransformer

It's an interesting question as it begs questions of conceptual "boundaries."

The sense-plan-do process requires a search and filter process for task switching, assuming an agent can do more than one thing.

So assuming you have a robotic/autonomous agent that is a collection of systems (locomotion, dexterous gripper, visual perception, etc...), if each system could be represented as an "expert module", say for example the dexterous manipulator, then so long as a discriminator can appropriately switch states using the sensor/system inputs, then it's conceptually possible that there is a canonical "expert module" that everyone uses and therefore "general purpose" would apply to the agent as a whole while expert model would apply to the dexterous manipulator.

You can walk that reasoning up the abstraction layers then to conclude that (as usual with these turtle stacks) the distinctions come as each sub system/module specializes more granularly for the environment they operate in.

I think that it's probably forever and always true that any system designed to explore/exploit a bounded environment with comprehensive observations, will always outperform a system that is required to adapt it's sense-plan-do components to the bounded environment without similar observations.

A system would either have to generate different observations than the native agent, or change the boundaries of the environment in a way that is unavailable to the native agent in order to outperform it.

throwawayadvsec2y ago

I'm pretty sure it's a relatively small model?

If you had the same quantity of text data as GPT-4 + comparable quantity of data for other domains, it could probably learn transferable skills across those domains.

But it would take a huge amount of processing power that is probably not attainable today

sebzim45002y ago

>Is there a reason I should not believe that an expert model will always outperform a general purpose one, even if it's a metatransformer?

If a general purpose model beats the specialized one, you could almost certainly distill the general purpose one into a better specialized one.

bick_nyers2y ago

Yo dawg, we just need to figure out what x converges to as you apply transformer() infinite times and then finally attention will no longer be all you need:

transformer(transformer(transformer( ... x ... ))) = ?

nh23423fefe2y ago

performance is bounded and so outperformance will approach episilon?

orwin2y ago

Yeah, that's where I thought it would go shortly after I tried GPT-4 from openAI. We're clearly at the transformer limits imho (comparing the effectiveness between 3.5 and 4, and the number of parameter in each model is why I think we reached a soft cap).

So since it'll be hard to go deeper, going broader by interlacing different model types might be a way to pierce through.

whimsicalism2y ago

> We're clearly at the transformer limits imho

GPT-4 did not scale up substantially in depth, going from 175 b to 220 b per transformer.

CSMastermind2y ago

Wouldn't making the model multimodal require scaling the models significantly?

Or is the idea to keep the network the same size and trade off some of its nodes for image, video, etc. data?

If so has anyone shown that doing so results in better overall performance?

My lay-observation is that GPT-4 seems to be on the border of usability for most applications so if nothing is gained by simply changing the input data type as opposed to expanding the model then it feels like it won't be of much use yet.

Also apologies if I'm not making sense, I'm almost certainly not using to correct technical terms to articulate what I'm thinking.

whimsicalism2y ago

> Wouldn't making the model multimodal require scaling the models significantly?

Just width if that makes sense. Basically, you add another encoder model but you are not actually increasing the depth that much.

ccheney2y ago

We need to start ingesting raw scientific data through these models and see what it comes up with. What could these models identify by parsing through raw JWST or Hubble data? Or training against every published scientific paper? Is anyone doing this sort of thing already?

danielbln2y ago

Meta's Galactica was an attempt to train an LLM predominantly on scientific papers, articles and so on. It failed pretty spectacularly but Galactica 2, if that's ever a things, might rectify that.

RC_ITR2y ago

GP likely means training transformers on raw data (similar to protein folding transformers) to find patterns that humans cannot (due to lack of context, bias, or whatever).

Problem with the assumption though is that transformers are good at identifying and replicating patterns given a set of rules (i.e. how proteins fold and misfold depending on the environment).

Hubble data isn’t so much “we know the rules but not their interactions” as much as “we don’t really know the full set of rules,” so that particular example probably wouldn’t be that fruitful.

In general, biology (where we understand the basic rules but not the complex ways they are combined) is the most fertile ground for transformer driven research.

FrustratedMonky2y ago

Just few more steps like this, put it in a robot body, and Voilà , we have start of the first AI wars. How many centuries after this does the Butlerian Jihad start, lead by John Conner, of course?

Oras2y ago

According to the website, the model can then fine-tuned for certain tasks such as image classification.

1. How does the multi-model help here in improving the accuracy of image classification when training is combined from text, images, and audio?

2. How about the speed? I would imagine a model with text, audio and image data would be larger compared to text-only models?

ImHereToVote2y ago

This seems like a step in the dangerous direction.

valine2y ago

It’ll be ok. The technology for “dangerous” AI doesn’t actually exist. The near term risks we face from AI are constrained to the realms of spam and privacy. World ending super-bots are science fiction.

flangola72y ago

Blind denial. No argument or evidence presented, merely bold statements made with the expectation they be taken without question.

Flying humans was science fiction 120 years ago. A single bomb able to destroy an entire city was science fiction 80 years ago. A machine that can complete more mathematical calculations in one minute than all human manual computation in history was science fiction 60 years ago. EUV photolithography capable of creating molecule-sized transistors was science fiction 30 years ago. A computer that can create visual art and talk to you in plain English was science fiction 2 years ago. A computer that can clone your voice and mannerisms was science fiction 1 year ago.

Science fiction has a way of becoming non-fiction, often within the span of a generation or less.

naasking2y ago

> It’ll be ok. The technology for “dangerous” AI doesn’t actually exist.

Nobody's worried about the tech that exists.

> The near term risks we face from AI are constrained to the realms of spam and privacy.

Define "near term".

> World ending super-bots are science fiction.

Science fiction has become science fact before. Where's the knockdown argument that won't happen in this case?

valine2y ago

Its not feasible to worry about the implications of every imaginary technology. Nuclear chain reactions were first theorized to exist a decade before the first bomb dropped. Should scientists have stopped exploring quantum mechanics in the 30s? Fear of the unknown shouldn’t be allowed to stop scientific progress.

We can deal with the implications of dangerous AI if and when it becomes a problem.

3 more replies

FrustratedMonky2y ago

>> "The technology for “dangerous” AI doesn’t actually exist"

What? Did you not see the Netflix documentary on AI for military use? They literally have AI's that can beat fighter pilots in dog fighting.

Just because it isn't walking around having coffee and chatting you up, doesn't mean it isn't already very advanced and deadly.

valine2y ago

Dog fighting AI isn’t going to end the world. When people talk about the “risks” associated with AI they’re talking about an AI that spirals out of control and destroys civilization. Something something infinite paper clip optimizer.

It’s scifi themed end-times cosplay.

1 more reply

danielbln2y ago

Before superintelligence scifi stuff we'll probably get some sort of superworm. Some rogue autonomous agent network that is improving itself via some framework like SKILL[1] going around 0-day'ing systems left and right and wreaking havoc.

[1] https://arxiv.org/abs/2010.11944

naasking2y ago

WormGPT already exists. These will only become more dangerous as the tech evolves.

sebzim45002y ago

I am also concerned about existential threats from AI, but part of the problem is that I have no idea which research directions help and which ones hurt.

ImHereToVote2y ago

AI safety is a field of its own.

https://80000hours.org/career-reviews/ai-safety-researcher/

faktory2y ago

Why?

FrustratedMonky2y ago

Because up till now many people that discount AI threats base that discount on a few assumptions like 'its just a parrot', 'it doesn't have any drives', 'it doesn't really understand', 'it isn't conscious', etc... ad-Infinium.

But the more different technology is plugged together to start resembling a brain, like a visual cortex, a speech center, motor controls, etc...

At some point the distinction between carbon based life and silicon becomes meaningless vanishes. All the arguments or proofs that humans are conscious would equally prove AI is conscious. Or that neither truly are. Proving an AI is not conscious would also prove humans aren't.

And of course, Terminators.

j / k navigate · click thread line to collapse

35 comments

kristjank2y ago

danielbln2y ago

I think it aims to leverage the cross-modal relationships and unified learning, which might not be possible with expert models designed for only a single modality.

A general-purpose model could also be a more cost-effective solution in some cases, ensemble experts are difficult to scale and parallelize.

AndrewKemendo2y ago

>an expert model will always outperform a general purpose one, even if it's a metatransformer

It's an interesting question as it begs questions of conceptual "boundaries."

The sense-plan-do process requires a search and filter process for task switching, assuming an agent can do more than one thing.

throwawayadvsec2y ago

I'm pretty sure it's a relatively small model?

If you had the same quantity of text data as GPT-4 + comparable quantity of data for other domains, it could probably learn transferable skills across those domains.

But it would take a huge amount of processing power that is probably not attainable today

sebzim45002y ago

>Is there a reason I should not believe that an expert model will always outperform a general purpose one, even if it's a metatransformer?

If a general purpose model beats the specialized one, you could almost certainly distill the general purpose one into a better specialized one.

bick_nyers2y ago

Yo dawg, we just need to figure out what x converges to as you apply transformer() infinite times and then finally attention will no longer be all you need:

transformer(transformer(transformer( ... x ... ))) = ?

nh23423fefe2y ago

performance is bounded and so outperformance will approach episilon?

orwin2y ago

So since it'll be hard to go deeper, going broader by interlacing different model types might be a way to pierce through.

whimsicalism2y ago

> We're clearly at the transformer limits imho

GPT-4 did not scale up substantially in depth, going from 175 b to 220 b per transformer.

CSMastermind2y ago

Wouldn't making the model multimodal require scaling the models significantly?

Or is the idea to keep the network the same size and trade off some of its nodes for image, video, etc. data?

If so has anyone shown that doing so results in better overall performance?

Also apologies if I'm not making sense, I'm almost certainly not using to correct technical terms to articulate what I'm thinking.

whimsicalism2y ago

> Wouldn't making the model multimodal require scaling the models significantly?

Just width if that makes sense. Basically, you add another encoder model but you are not actually increasing the depth that much.

ccheney2y ago

danielbln2y ago

Meta's Galactica was an attempt to train an LLM predominantly on scientific papers, articles and so on. It failed pretty spectacularly but Galactica 2, if that's ever a things, might rectify that.

RC_ITR2y ago

GP likely means training transformers on raw data (similar to protein folding transformers) to find patterns that humans cannot (due to lack of context, bias, or whatever).

Problem with the assumption though is that transformers are good at identifying and replicating patterns given a set of rules (i.e. how proteins fold and misfold depending on the environment).

In general, biology (where we understand the basic rules but not the complex ways they are combined) is the most fertile ground for transformer driven research.

FrustratedMonky2y ago

Just few more steps like this, put it in a robot body, and Voilà , we have start of the first AI wars. How many centuries after this does the Butlerian Jihad start, lead by John Conner, of course?

Oras2y ago

According to the website, the model can then fine-tuned for certain tasks such as image classification.

1. How does the multi-model help here in improving the accuracy of image classification when training is combined from text, images, and audio?

2. How about the speed? I would imagine a model with text, audio and image data would be larger compared to text-only models?

ImHereToVote2y ago

This seems like a step in the dangerous direction.

valine2y ago

flangola72y ago

Blind denial. No argument or evidence presented, merely bold statements made with the expectation they be taken without question.

Science fiction has a way of becoming non-fiction, often within the span of a generation or less.

naasking2y ago

> It’ll be ok. The technology for “dangerous” AI doesn’t actually exist.

Nobody's worried about the tech that exists.

> The near term risks we face from AI are constrained to the realms of spam and privacy.

Define "near term".

> World ending super-bots are science fiction.

Science fiction has become science fact before. Where's the knockdown argument that won't happen in this case?

valine2y ago

We can deal with the implications of dangerous AI if and when it becomes a problem.

3 more replies

FrustratedMonky2y ago

>> "The technology for “dangerous” AI doesn’t actually exist"

What? Did you not see the Netflix documentary on AI for military use? They literally have AI's that can beat fighter pilots in dog fighting.

Just because it isn't walking around having coffee and chatting you up, doesn't mean it isn't already very advanced and deadly.

valine2y ago

It’s scifi themed end-times cosplay.

1 more reply

danielbln2y ago

[1] https://arxiv.org/abs/2010.11944

naasking2y ago

WormGPT already exists. These will only become more dangerous as the tech evolves.

sebzim45002y ago

I am also concerned about existential threats from AI, but part of the problem is that I have no idea which research directions help and which ones hurt.

ImHereToVote2y ago

AI safety is a field of its own.

https://80000hours.org/career-reviews/ai-safety-researcher/

faktory2y ago

Why?

FrustratedMonky2y ago

But the more different technology is plugged together to start resembling a brain, like a visual cortex, a speech center, motor controls, etc...

And of course, Terminators.

j / k navigate · click thread line to collapse