The ablation studies are well done, comprehensive and expensive to do. People will be using the conclusions from this for years, and that is much more impactful than if an upcoming Siri product ourperforms the GPT model at that same point in time.
A few really interesting points:
Synthetic datasets substantially (1%+) increase performance for Image Encoder Pre-training
Architecture of the Visual<->Language model connector doesn't seem to matter.
Interleaving text and image data improves few shot performance, but image captioning data improves zero-shot numbers.
The ideal mix of data types is 5:5:1 for Interleaved:Captions:Plain Text (!)
Synthetic captioning data helps substantially at this point too (up to 4% gain)
The appendices are amazing: lots of details about learning rates tried, batch sizes.
The "explain these figures" are really really good. See page 37.
The actual set of models produced (up to 30B parameters) seems secondary to the intent of the paper, and is more validation of the best design choices in each area.
If all it does is improve Siri a bit without massively expanding the range of applications and APIs it will be a big disappointment.
I think what Apple presents in June will decide whether on-device AI will be seen as a viable alternative to cloud APIs.
I don't usually say this, but TFA frankly feels like it was written by AI:
> The release of MM1 by Apple contributes significantly to the artificial intelligence domain, offering a detailed roadmap for the development of future MLLMs. By sharing the insights and design principles gleaned from MM1, Apple not only challenges the current capabilities of models like ChatGPT but also invites the broader AI community to build upon their findings, potentially leading to more sophisticated and capable AI systems.
Absolutely no benchmarks against GPT4 present in the paper.
Notably they used instruction response pairs generated from GPT4 for supervised fine tuning. Which has always felt like an experimental hack to me, but that’s how many folks are bootstrapping smaller models these days, and the effectiveness is hard to argue with.
Apple’s axlearn framework was used which leverages JAX and XLA [2].
Table 4 on page 14 shows comparisons to GPT4V
My dream is to one day be listed on a seminal paper as "secondary forum reply author".
I recall that my undergrad institution once invented a new deanship out of whole cloth for a coach who'd maxed out on the "professor" pay scale.
Even worse, the bastard didn't even win games!
PS Can I be your hairdresser?
You can aspire higher and just use one of these LLMs to be a "first author" in a published peer reviewed paper.
Maybe I find conversational UIs awkward, or maybe I just got jaded REALLY quickly from Siri’s lacking capabilities early on, but I have hardly used it in the decade or whatever that it’s been around.
I outsource so much of my memory to the phone via Siri ALL THE TIME. It’s so useful. Even for things in 20m. I’ll easily forget if I don’t do this, and it’s reliable so it gives me confidence. It also keeps the notification present until I actually do the thing, so I have a kind of string around my finger until the task is accomplished. I can also snooze that notification as needed to rebring it up at the right time.
Every time I do this around non-tech people they go “wow I didn’t know you could do that.” I swear it’s literally life changing, particularly for anyone over 30.
Google can also do this. Alexa has lots of problems, but it can raise a blind in a pinch. We also spent a ton on Lutron shades because we discovered that we were just managing them too much manually (Siri then is great for controlling that).
You can also ask Siri the weather in the morning, useful in figuring out how to dress the kid.
1. Find my phone via Siri on homepod
2. Set a simple timer
3. Add to a list
4. Send a text message to one of a few contacts
It can and sometimes does do all of those things, but horribly unreliably.
I like that I can model Siri as a decision tree with voice-activated input. Being able to configure it to do more things (for example, to put reminders in Things rather than Reminders), that would be useful. More discoverability would also be great (but this is Apple we're talking about, so good luck there). But for me personally, the most important feature is that Siri is predictable: once I figure out how to do something with it, asking again in mostly the same way will get the same result. If I want to talk to an LLM, I have ChatGPT on my phone.
It’s not perfect, for sure:
Me: Hey Siri, turn off the kitchen lights.
Siri: I can’t process multiple requests.
Me: Hey Siri, turn off the kitchen lights.
Siri: OK.
But it works reliably enough that I use it all the time for the reminder and timer actions. Is it vastly worse for other people, and in what ways?
I’m still baffled at Siri and Google assistant. Virtually zero innovation in a decade. I just want to be able to turn on BBC radio while my hands are wet, is that really so hard?!
You're in luck! Siri will do that right now. Just tried it. Works.
Knowing Apple, I was expecting one base timer, with every other timer being a $200 upgrade.
“Hey Siri set an egg timer for 4 minutes”
The interface for switching between multiple timers sucks on the watch, the whole app does now. I don’t know how it’s handled on HomePods, though you can see them somewhere in the home app (yeah that’s discoverable).
But it works fine. And the interface is good on the phone.