I think it aims to leverage the cross-modal relationships and unified learning, which might not be possible with expert models designed for only a single modality.
Even if it performs slightly worse on some tasks, the ability to handle multiple modalities within a single framework is an pretty sweet advantage in scenarios where data from various sources need to be processed simultaneously, and patterns across modalities need to be captured somehow.
A general-purpose model could also be a more cost-effective solution in some cases, ensemble experts are difficult to scale and parallelize.
It's an interesting question as it begs questions of conceptual "boundaries."
The sense-plan-do process requires a search and filter process for task switching, assuming an agent can do more than one thing.
So assuming you have a robotic/autonomous agent that is a collection of systems (locomotion, dexterous gripper, visual perception, etc...), if each system could be represented as an "expert module", say for example the dexterous manipulator, then so long as a discriminator can appropriately switch states using the sensor/system inputs, then it's conceptually possible that there is a canonical "expert module" that everyone uses and therefore "general purpose" would apply to the agent as a whole while expert model would apply to the dexterous manipulator.
You can walk that reasoning up the abstraction layers then to conclude that (as usual with these turtle stacks) the distinctions come as each sub system/module specializes more granularly for the environment they operate in.
I think that it's probably forever and always true that any system designed to explore/exploit a bounded environment with comprehensive observations, will always outperform a system that is required to adapt it's sense-plan-do components to the bounded environment without similar observations.
A system would either have to generate different observations than the native agent, or change the boundaries of the environment in a way that is unavailable to the native agent in order to outperform it.
If you had the same quantity of text data as GPT-4 + comparable quantity of data for other domains, it could probably learn transferable skills across those domains.
But it would take a huge amount of processing power that is probably not attainable today
If a general purpose model beats the specialized one, you could almost certainly distill the general purpose one into a better specialized one.
transformer(transformer(transformer( ... x ... ))) = ?
So since it'll be hard to go deeper, going broader by interlacing different model types might be a way to pierce through.
GPT-4 did not scale up substantially in depth, going from 175 b to 220 b per transformer.
Or is the idea to keep the network the same size and trade off some of its nodes for image, video, etc. data?
If so has anyone shown that doing so results in better overall performance?
My lay-observation is that GPT-4 seems to be on the border of usability for most applications so if nothing is gained by simply changing the input data type as opposed to expanding the model then it feels like it won't be of much use yet.
Also apologies if I'm not making sense, I'm almost certainly not using to correct technical terms to articulate what I'm thinking.
Problem with the assumption though is that transformers are good at identifying and replicating patterns given a set of rules (i.e. how proteins fold and misfold depending on the environment).
Hubble data isn’t so much “we know the rules but not their interactions” as much as “we don’t really know the full set of rules,” so that particular example probably wouldn’t be that fruitful.
In general, biology (where we understand the basic rules but not the complex ways they are combined) is the most fertile ground for transformer driven research.
1. How does the multi-model help here in improving the accuracy of image classification when training is combined from text, images, and audio?
2. How about the speed? I would imagine a model with text, audio and image data would be larger compared to text-only models?
Flying humans was science fiction 120 years ago. A single bomb able to destroy an entire city was science fiction 80 years ago. A machine that can complete more mathematical calculations in one minute than all human manual computation in history was science fiction 60 years ago. EUV photolithography capable of creating molecule-sized transistors was science fiction 30 years ago. A computer that can create visual art and talk to you in plain English was science fiction 2 years ago. A computer that can clone your voice and mannerisms was science fiction 1 year ago.
Science fiction has a way of becoming non-fiction, often within the span of a generation or less.
Nobody's worried about the tech that exists.
> The near term risks we face from AI are constrained to the realms of spam and privacy.
Define "near term".
> World ending super-bots are science fiction.
Science fiction has become science fact before. Where's the knockdown argument that won't happen in this case?
What? Did you not see the Netflix documentary on AI for military use? They literally have AI's that can beat fighter pilots in dog fighting.
Just because it isn't walking around having coffee and chatting you up, doesn't mean it isn't already very advanced and deadly.
But the more different technology is plugged together to start resembling a brain, like a visual cortex, a speech center, motor controls, etc...
At some point the distinction between carbon based life and silicon becomes meaningless vanishes. All the arguments or proofs that humans are conscious would equally prove AI is conscious. Or that neither truly are. Proving an AI is not conscious would also prove humans aren't.
And of course, Terminators.