It's interesting that OpenAI is highlighting the Elo score instead of showing results for many many benchmarks that all models are stuck at 50-70% success.
[1] https://twitter.com/LiamFedus/status/1790064963966370209
I don't really care whether it's stronger than gpt-4-turbo or not. The direct real-time video and audio capabilities are absolutely magical and stunning. The responses in voice mode are now instantaneous, you can interrupt the model, you can talk to it while showing it a video, and it understands (and uses) intonation and emotion.
Really, just watch the live demo. I linked directly to where it starts.
Importantly, this makes the interaction a lot more "human-like".
I hope the next version delivers on being smarter, as this update instead of making me excited, makes me feel they’ve reached a plateau on the improvement of the core value and are distracting us with fluff instead
I predict there will be a zoo (more precisely tree, as in "family tree") of models and derived models for particular application purposes, and there will be continued development of enhanced "universal"/foundational models as well. Some will focus on minimizing memory, others on minimizing pre-training or fine-tuning energy consumption, some need high accuracy, others hard realtime speed, yet others multimodality like GPT4.o, some multilinguality, and so on.
Previous language models that encoded dictionaries for spellcheckers etc. never got standardized (for instance, compare aspell dictionaries to the ones from LibreOffice to the language model inside CMU PocketSphinx) so that you could use them across applications or operating systems. As these models are becoming more common, it would be interesting to see this aspect improve this time around.
https://www.rev.com/blog/resources/the-5-best-open-source-sp...
I think people who emphasis specialized models are operating under a false assumption that by focusing the model it'll be able to go deeper in that domain. However, the opposite seems to be true.
Granted, specialized models like AlphaFold are superior in their domain but I think that'll be less true as models become more capable at general learning.
That said, given the price tag, when AI becomes genuinely expert then I'm probably not going to have a job and neither will anyone else (modulo how much electrical power those humanoid robots need, as the global electricity supply is currently only 250 W/capita).
In the meantime, making it a properly real-time conversational partner… wow. Also, that's kinda what you need for real-time translation, because: «be this, that different languages the word order totally alter and important words at entirely different places in the sentence put», and real-time "translation" (even when done by a human) therefore requires having a good idea what the speaker was going to say before they get there, and being able to back-track when (as is inevitable) the anticipated topic was actually something completely different and so the "translation" wasn't.
A real time translator would be a killer app indeed, and it seems not so far away, but note how you have to prompt the interaction with ‘Hey ChatGPT’; it does not interject on its own. It is also unclear if it is able to understand if multiple people are speaking and who’s who. I guess we’ll see soon enough :)
GPT-4o got slightly better overall. Ability to reason improved more than the rest.
This model isn't about basemark chasing or being a better code generator; it's entirely explicitly focused on pushing prior results into the frame of multi-modal interaction.
It's still a WIP, most of the videos show awkwardness where its capacity to understand the "flow" of human speech is still vestigial. It doesn't understand how humans pause and give one another space for such pauses yet.
But it has some indeed magic ability to share a deictic frame of reference.
I have been waiting for this specific advance, because it is going to significantly quiet the "stochastic parrot" line of wilfully-myopic criticism.
It is very hard to make blustery claims about "glorified Markov token generation" when using language in a way that requires both a shared world model and an understanding of interlocutor intent, focus, etc.
This is edging closer to the moment when it becomes very hard to argue that system does not have some form of self-model and a world model within which self, other, and other objects and environments exist with inferred and explicit relationships.
This is just the beginning. It will be very interesting to see how strong its current abilities are in this domain; it's one thing to have object classification—another thing entirely to infer "scripts plans goals..." and things like intent, and, deixis. E.g. how well does it now understand "us" and "them" and "this" vs "that"?
Exciting times. Scary times. Yee hawwwww.
> using language in a way that requires both a shared world model
Where? What example of GPT-4o requires a shared world model? The customer support example?
The reason GPT-4 does not have any meaningful world model (in the sense that rats have meaningful world models) is that it freely believes contradictory facts without being confused, freely confabulates without having brain damage, and it has no real understanding of quantity or causality. Nothing in GPT-4o fixes that, and gpt2-chatbot certainly had the same problems with hallucinations and failing the same pigeon-level math problems that all other GPTs fail.
>that it freely believes contradictory facts without being confused,
Humans do this. You do this. I guess you don't have a meaningful world model.
>freely confabulates without having brain damage
Humans do this
>and it has no real understanding of quantity or causality.
Well this one is just wrong.
They really Put That There!
https://www.youtube.com/watch?v=RyBEUyEtxQo
Oh, shit.
So local modelling (completely offline but per speaker aware and responsive), with a really flexible application API. Sort of the GTK or QT equivalent for voice interactions. Also custom naming, so instead of "Hey Siri" or "Hey Google" I could say, "Hey idiot" :-)
Definitely some interesting tech here.
with a GPT you can modify the system prompt
We'll have to see when end users actually get access to the voice features "in the coming weeks".
Probably some kinks there they are working out
Or just a good idea for a live demo on a congested network/environment with a lot of media present, at least one live video stream (the one we're watching the recording of), etc.
At least that's how I understood it, not that they had a problem with it (consistently or under regular conditions, or specific to their app).
Thanks for this.
Skinner: "Yes."
Chalmers: "May I see it?"
Skinner: "No."
Perhaps it's worth taking a beat and looking at the incredible progress in that year, and acknowledge that whatever's next is probably "still cooking".
Even Meta is still baking their 400B parameter model.
Skinner (looking up): No, mother, it's just the Nvidia GPUs.
Skinner (looking up): "No, mother, it's just the Nvidia GPUs."
"No, mother, that's just the H100s."
But I am not convinced it will be another GPT-4 moment. Seems like big focus on tacking together multi-modal clever tricks vs straight better intelligence AI.
Hope they prove me wrong!
Multimodal is another way to stave off the inevitable, because these AI companies already are training multiple models on different piles of information. If you have to train a text model and an image model, why split your training data in half when you could train a combined model on a combined dataset?
[0] For starters, most YouTube videos aren't manually captioned, so you're feeding GPT the output of Google's autocaptioning model, so it's going to start learning artifacts of what that model can't process.
I'd bet a lot of YouTubers are using LLMs to write and/or edit content. So we pass that through a human presentation. Then introduce some errors in the form of transcription. Turn feed the output in as part of a training corpus ... we plateaued real quick.
It seems like it's hard to get past a level of human intelligence at which there's a large enough corpus of training data or trainers?
Anyone know of any papers on breaking this limit to push machine learning models to super-human intelligence levels?
Whisper models are better than anything google has. In fact the higher quality whisper models are better than humans when it comes to transcribing text with punctuation.
I would expect they’re using their own t2s which is still a model but way better quality and potentially customizable to better suit their needs
Improving the instruction tuning, the RLHF step, increase the training size, work on multilingual capabilities, etc. make sense as a way to improve quality, but I think increasing model size doesn't. Being able to advertize a big breakthrough may make sense in terms of marketing, but I don't believe it's going to happen for two reasons:
- you don't release intermediate steps when you want to be able to advertise big gains, because it raises the baseline and reduce the effectiveness of your ”big gains” in terms of marketing.
- I don't think they would benefit in an arm race with Meta, trying to keeping a significant edge. Meta is likely to be able to catch-up eventually on performance, but they are not so much of a threat in terms of business. Focusing on keeping a performance edge instead of making their business viable would be a strategic blunder.
They have the brand recognition (for ChatGPT) and that's a good start, but that's not enough. Providing a best in class user experience (which seems to be their focus now, with multimodality), a way to lock down their customers in some kind of walled garden, building some kind of network effect (what they tried with their marketplace for community-built “GPTs” last fall but I'm not sure it's working), something else?
At the end of the day they have no technological moat, so they'll need to build a business one, or perish.
For most tasks, pretty much every models from their competitors is more than good enough already, and it's only going to get worse as everyone improves. Being marginally better on 2% of tasks isn't going to be enough.
Seems to me that performance is converging and we might not see a significant jump until we have another breakthrough.
It doesn't seem that way to me. But even if it did, video generation also seemed kind of stagnant before Sora.
In general, I think The Bitter Lesson is the biggest factor at play here, and compute power is not stagnating.
I take the opposite view. I don't think video generation was stagnating at all, and was in fact probably the area of generative AI that was seeing the biggest active strides. I'm highly optimistic about the future trajectory of image and video models.
By contrast, text generation has not improved significantly, in my opinion, for more than a year now, and even the improvement we saw back then was relatively marginal compared to GPT-3.5 (that is, for most day-to-day use cases we didn't really go from "this model can't do this task" to "this model can now do this task". It was more just "this model does these pre-existing tasks, in somewhat more detail".)
If OpenAI really is secretly cooking up some huge reasoning improvements for their text models, I'll eat my hat. But for now I'm skeptical.
I'll reserve judgment until we see GPT5, but if it becomes just a matter of who best can monetize existing capabilities, OAI isn't the best positioned.
Trying to talk it into writing anything other than toy code is an exercise in banging my head against the wall.
This model has been being tested under a code name of ‘gpt2-chatbot’ but it is very much a new GPT4+-level model, with new multimodal capabilities - but apparently some impressive work around inference speed.
Highlighting so people don’t get the impression this is just OpenAI slapping a new label on something a generation out of date.
(text input in web version)
maybe it's programmed to completely ignore swearing but how could I not swear after it gave me repeatedly info about you.com when I try to address it in second person
The improvements they seem to be hyping are in multimodality and speed (also price – half that of GPT-4 Turbo – though that’s their choice and could be promotional, but I expect it’s at least in part, like speed, a consequence of greater efficiency), not so much producing better output for the same pure-text inputs.
and the prompt wasn't a monstrosity, and it wasn't even that good, it was just one line "I need help to categorize these expenses" and off it went. hope it won't get enshittified like turbo, because this finally feels as great as 3.5 was for goal seeking.
The "gpt2-chatbot" was the worst of the three.