MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training (opens in new tab)

(arxiv.org)

179 pointslord_sudo2y ago60 comments

60 comments

nl2y ago

This is an awesome paper, and the somewhat negative sentiment in the discussion here is surprising.

The ablation studies are well done, comprehensive and expensive to do. People will be using the conclusions from this for years, and that is much more impactful than if an upcoming Siri product ourperforms the GPT model at that same point in time.

A few really interesting points:

Synthetic datasets substantially (1%+) increase performance for Image Encoder Pre-training

Architecture of the Visual<->Language model connector doesn't seem to matter.

Interleaving text and image data improves few shot performance, but image captioning data improves zero-shot numbers.

The ideal mix of data types is 5:5:1 for Interleaved:Captions:Plain Text (!)

Synthetic captioning data helps substantially at this point too (up to 4% gain)

The appendices are amazing: lots of details about learning rates tried, batch sizes.

The "explain these figures" are really really good. See page 37.

brookst2y ago

The paper explores different design choices for various parts of the model and draws conclusions about the relative importance of optimizing each area (image encoder very important, vision-language connector less so).

The actual set of models produced (up to 30B parameters) seems secondary to the intent of the paper, and is more validation of the best design choices in each area.

reaperman2y ago

This looks competitive against CLIP, and surprisingly great at VQA style prompts, but it doesn't seem like the paper supports comparing it to GPT-4. We don't see any tests for coding performance, math homework, legal document review, or any of the myriad other things that people use GPT-4 for on a daily basis.

zshrc2y ago

Besides homework, all of these things seem to be professional uses of GPT-4. If they’re trying to bake this into a consumer platform like Siri, I don’t see why they’d need to focus on those use cases. Besides MDM/Enterprise, which will be curious if they try and attack this market or just their army of consumer devices.

fauigerzigerk2y ago

They are going to have to focus on the use cases that most of their customers use LLMs for, regardless of whether it falls in the consumer or professional category or somewhere in between.

If all it does is improve Siri a bit without massively expanding the range of applications and APIs it will be a big disappointment.

I think what Apple presents in June will decide whether on-device AI will be seen as a viable alternative to cloud APIs.

jwells892y ago

Many users of Siri would be thrilled if all this did was made it decent at understanding what’s being asked of it and gave it the ability to ask clarifying questions, especially if it does so staying fully local.

1 more reply

reaperman2y ago

Good insight. My comment was based on the headline that says "...Competing with ChatGPT".

lolinder2y ago

MM1 is a research paper, not a release of a competing product. I'm sure the paper is interesting and am looking forward to reading an analysis of it by someone who understands these things better than I do, but this is not that analysis, it's an extremely low-effort puff piece that is more interested in getting attention than in accurately describing a research paper.

I don't usually say this, but TFA frankly feels like it was written by AI:

> The release of MM1 by Apple contributes significantly to the artificial intelligence domain, offering a detailed roadmap for the development of future MLLMs. By sharing the insights and design principles gleaned from MM1, Apple not only challenges the current capabilities of models like ChatGPT but also invites the broader AI community to build upon their findings, potentially leading to more sophisticated and capable AI systems.

basicallybones2y ago

I believe most run-of-the-mill marketing language will sound like it is written in AI. The easiest thing to do for technology writing is to write the complete, factual article, then ask an LLM to dumb it down to whatever level you need for communication.

JimDabell2y ago

No, I agree this really does seem autogenerated, or at the very least written by somebody who doesn’t understand the topic at all and is going through the motions of padding things out to hit a hype / word count. It’s got that weird summary focusing on the wrong things and wild speculations dressed up as serious predictions vibe, like there are words saying things in places because there are supposed to be words there and not because it’s actually imparting useful information.

CharlesW2y ago

Out of curiosity, where are you seeing this? It's not in the abstract or the paper.

JimDabell2y ago

Some of these comments were originally made in response to this spammy submission:

https://news.ycombinator.com/item?id=39726156

lolinder2y ago

Oh, thank you! I didn't know we'd been moved.

CharlesW2y ago

Ah! Makes sense now, thank you.

refibrillator2y ago

Biggest model is 30b MoE trained on 100b tokens, max sequence length 4096. A bit underwhelming compared to recent announcements like the open source Large World Model [1].

Absolutely no benchmarks against GPT4 present in the paper.

Notably they used instruction response pairs generated from GPT4 for supervised fine tuning. Which has always felt like an experimental hack to me, but that’s how many folks are bootstrapping smaller models these days, and the effectiveness is hard to argue with.

Apple’s axlearn framework was used which leverages JAX and XLA [2].

[1] https://news.ycombinator.com/item?id=39367141

[2] https://github.com/apple/axlearn

Etheryte2y ago

You seem to be missing what this submission is about. It's not an Apple press release about a competing model, it's a research paper that discusses different tradeoffs in architecture and data and how each part affects the results of the trained model. In an era where training a large model can be cost prohibitive, this insight is key — it tells you where to optimize and where to cut corners to get the most bang for your buck.

AJRF2y ago

> Absolutely no benchmarks against GPT4 present in the paper.

Table 4 on page 14 shows comparisons to GPT4V

1 more reply

pushedx2y ago

This has an unfortunate naming collision with the M/M/1 queue, a common stochastic model for the study of queueing theory.

https://en.wikipedia.org/wiki/M/M/1_queue

smokel2y ago

The paper lists "first authors", "core authors", and "senior authors".

My dream is to one day be listed on a seminal paper as "secondary forum reply author".

jebarker2y ago

Speaking as someone working in the field, I find it amusing how much researchers working on automating human work care about human credit assignment.

Eager2y ago

extremely underrated comment. nice one!

peddling-brink2y ago

Similarly, I’d like the movie credit Second Assistant to the Second Second Assistant Director.

Turing_Machine2y ago

"Junior Assistant Vice-Dean" (or variants thereof) in academia. Those mostly exist to give a pay boost to administrators who've otherwise maxed out on pay.

I recall that my undergrad institution once invented a new deanship out of whole cloth for a coach who'd maxed out on the "professor" pay scale.

Even worse, the bastard didn't even win games!

smokel2y ago

In that case, I highly recommend watching the movie Synecdoche New York (2008).

PS Can I be your hairdresser?

aspenmayer2y ago

I'll second that recommendation, and in that same sort of vibe I'd also recommend Birdman or (The Unexpected Virtue of Ignorance) (2014) and Station Eleven (2021-2022). They all have aspects of stories within stories, which is a trope that I particularly enjoy.

https://en.wikipedia.org/wiki/Story_within_a_story

verticalscaler2y ago

Holy inferiority complex batman!

You can aspire higher and just use one of these LLMs to be a "first author" in a published peer reviewed paper.

a_vanderbilt2y ago

I wonder if this has anything to do with their acquisition of DarwinAI. After a decade of mediocrity, I'd love to see Siri get smarter. Any improvement would be welcome at this point.

Aqua_Geek2y ago

Honest question: what do you (in the general sense, not specifically asking the parent) use Siri for? I think my main (only?) use case is setting a timer.

Maybe I find conversational UIs awkward, or maybe I just got jaded REALLY quickly from Siri’s lacking capabilities early on, but I have hardly used it in the decade or whatever that it’s been around.

azinman22y ago

I use it almost daily for something that is simple but under appreciated I don’t know why it’s not in every marketing video: “Siri, remind me tomorrow at 10am to do X”

I outsource so much of my memory to the phone via Siri ALL THE TIME. It’s so useful. Even for things in 20m. I’ll easily forget if I don’t do this, and it’s reliable so it gives me confidence. It also keeps the notification present until I actually do the thing, so I have a kind of string around my finger until the task is accomplished. I can also snooze that notification as needed to rebring it up at the right time.

Every time I do this around non-tech people they go “wow I didn’t know you could do that.” I swear it’s literally life changing, particularly for anyone over 30.

ribosometronome2y ago

Especially with Shortcuts, Siri can have some pretty useful functionality. My personal big improvement I'd like to see is being able to better able to tap into those actions without having to set things up in advance.

2 more replies

MBCook2y ago

Yep. Reminders is #1 by far, followed by sending texts, turning lights on/off with HomeKit and timers which are similar.

I can’t imagine reminders w/o Siri because that’s how I add 90%+ of them. Grocery items, things to do at time X, or when I get to (or leave) work/home are the big ones.

seanmcdirmid2y ago

Raising blinds, turning on/off lights, and unlocking the front door. It is convenient since I can do all those things with one command (raise all the blinds and turn off all the lights, or raise all the north blinds and lower the south ones), it would be a hard problem to create physical buttons to do what we needed without running around the room to hit various switches.

Google can also do this. Alexa has lots of problems, but it can raise a blind in a pinch. We also spent a ton on Lutron shades because we discovered that we were just managing them too much manually (Siri then is great for controlling that).

You can also ask Siri the weather in the morning, useful in figuring out how to dress the kid.

bombcar2y ago

If Siri could do the following reliably (meaning not having to ask again, not having to repeat, having it work 99% of the time) it would be golden:

1. Find my phone via Siri on homepod

2. Set a simple timer

3. Add to a list

4. Send a text message to one of a few contacts

It can and sometimes does do all of those things, but horribly unreliably.

csnweb2y ago

For me it really is extremely close to 100% for timers, I barely remember it being wrong and I use it several times per day. Finding my phone via the HomePod also works pretty much every time, may be 90% for me but it doesn’t recognize my wife so for her it basically never works. The others I don’t use enough. But timers and reminders work really well for me and it’s also what I need to most from an assistant.

1 more reply

bionhoward2y ago

Since they removed “hey” and I got the latest phone, I’ve noticed many little situations where it’s faster to speak to the device than tap your way around. E.G. when it’s locked you can say, “Siri, open Spotify” and look at it for face unlock, boom. Random stuff. Also Alexa has surprised me lately, like a rational response to, “how many sandwiches is too many?”

samatman2y ago

Personally, I don't want Siri to be 'smarter', if smarter means it becomes an open-ended and unpredictable way to have an LLM guess what I meant. I'd like Siri to be more powerful, yes.

I like that I can model Siri as a decision tree with voice-activated input. Being able to configure it to do more things (for example, to put reminders in Things rather than Reminders), that would be useful. More discoverability would also be great (but this is Apple we're talking about, so good luck there). But for me personally, the most important feature is that Siri is predictable: once I figure out how to do something with it, asking again in mostly the same way will get the same result. If I want to talk to an LLM, I have ChatGPT on my phone.

sroussey2y ago

I agree. The whole push to have Siri work on device was a noble one, but I’d rather have the option for a dumber on device Siri or a smarter in the cloud Siri.

epaga2y ago

Mediocrity is far too positive a word for the dumpster fire that is Siri.

kstrauser2y ago

I hear that a lot, and I have no desire to tell you your opinion’s wrong, but it doesn’t match my experience. Siri’s… fine, I guess, for what I ask of it like setting timers and reminders and such.

It’s not perfect, for sure:

Me: Hey Siri, turn off the kitchen lights.

Siri: I can’t process multiple requests.

Me: Hey Siri, turn off the kitchen lights.

Siri: OK.

But it works reliably enough that I use it all the time for the reminder and timer actions. Is it vastly worse for other people, and in what ways?

CharlesW2y ago

The characterization of "mediocre" is fair, but we're transiting a household to Siri from Alexa (because Alexa doesn't work locally, and because of Amazon's track record on privacy), and it's not noticeably worse.

MBCook2y ago

The feeling I’ve heard from people is Alexa was way better than Siri at first.

Over time Siri got better. Not great but better. Alexa had mostly stayed the same or perhaps gotten a touch worse except for adding ads and other annoyances.

I’ve never used anything but Siri. It works decently, definitely has its moods/dumb-as-a-post moments. But I’ve learn what works well and for that it’s proven very useful.

georgespencer2y ago

Same transition some years ago. Siri is noticeably much, much worse to me. Borderline hopeless on a mixture of HomePod + HomePod Mini hardware.

erulabs2y ago

If it’s going to take general artificial intelligent to get a voice assistant that can remember not one, but two entirely separate cooking timers, then so be it. Imagine the GPUs required!

I’m still baffled at Siri and Google assistant. Virtually zero innovation in a decade. I just want to be able to turn on BBC radio while my hands are wet, is that really so hard?!

samatman2y ago

> that can remember not one, but two entirely separate cooking timers

You're in luck! Siri will do that right now. Just tried it. Works.

Rinzler892y ago

OMG, 2 cooking timers?! Pinnacle tech right there.

Knowing Apple, I was expecting one base timer, with every other timer being a $200 upgrade.

CharlesW2y ago

https://www.tomsguide.com/how-to/how-to-set-up-and-manage-mu...

“How many timers can you have going at one time? […] …I had 26 timers going at once, and the only reason I didn't have more running was because I got bored.”

1 more reply

mortenjorck2y ago

That’s not really Apple’s style. More along the lines of “HomePod mini 2 features double the RAM, allowing for exciting new features like multiple kitchen timers. Pre-orders start Friday.”

bestnameever2y ago

You should be able to do this with Siri. You can use a shortcut if it doesn't work out of the box.

MBCook2y ago

It works out of the box as of iOS 16 or 17.

“Hey Siri set an egg timer for 4 minutes”

The interface for switching between multiple timers sucks on the watch, the whole app does now. I don’t know how it’s handled on HomePods, though you can see them somewhere in the home app (yeah that’s discoverable).

But it works fine. And the interface is good on the phone.

ksubedi2y ago

Google Assistant is pretty decent. But as someone who is pretty much locked into the Apple ecosystem, Siri needs a reboot from scratch.

astrange2y ago

It's been reportedly rewritten from scratch like five times, during which time people have not stopped posting claims that it's exactly the same as it was in 2010.

astrange2y ago

You mostly think there's no innovation because you speak English.

BryanLegend2y ago

Trainig

dang2y ago

Yes, the submitted title was "Apple announces MM1: Multimodal LLM Pre-trainig Report". We've reverted it now. But the greater problem wasn't the typo, it was the editorializing (from https://news.ycombinator.com/newsguidelines.html: "Please use the original title, unless it is misleading or linkbait; don't editorialize.")

j / k navigate · click thread line to collapse

60 comments

nl2y ago

This is an awesome paper, and the somewhat negative sentiment in the discussion here is surprising.

A few really interesting points:

Synthetic datasets substantially (1%+) increase performance for Image Encoder Pre-training

Architecture of the Visual<->Language model connector doesn't seem to matter.

Interleaving text and image data improves few shot performance, but image captioning data improves zero-shot numbers.

The ideal mix of data types is 5:5:1 for Interleaved:Captions:Plain Text (!)

Synthetic captioning data helps substantially at this point too (up to 4% gain)

The appendices are amazing: lots of details about learning rates tried, batch sizes.

The "explain these figures" are really really good. See page 37.

brookst2y ago

The actual set of models produced (up to 30B parameters) seems secondary to the intent of the paper, and is more validation of the best design choices in each area.

reaperman2y ago

zshrc2y ago

fauigerzigerk2y ago

They are going to have to focus on the use cases that most of their customers use LLMs for, regardless of whether it falls in the consumer or professional category or somewhere in between.

If all it does is improve Siri a bit without massively expanding the range of applications and APIs it will be a big disappointment.

I think what Apple presents in June will decide whether on-device AI will be seen as a viable alternative to cloud APIs.

jwells892y ago

1 more reply

reaperman2y ago

Good insight. My comment was based on the headline that says "...Competing with ChatGPT".

lolinder2y ago

I don't usually say this, but TFA frankly feels like it was written by AI:

basicallybones2y ago

JimDabell2y ago

CharlesW2y ago

Out of curiosity, where are you seeing this? It's not in the abstract or the paper.

JimDabell2y ago

Some of these comments were originally made in response to this spammy submission:

https://news.ycombinator.com/item?id=39726156

lolinder2y ago

Oh, thank you! I didn't know we'd been moved.

CharlesW2y ago

Ah! Makes sense now, thank you.

refibrillator2y ago

Biggest model is 30b MoE trained on 100b tokens, max sequence length 4096. A bit underwhelming compared to recent announcements like the open source Large World Model [1].

Absolutely no benchmarks against GPT4 present in the paper.

Apple’s axlearn framework was used which leverages JAX and XLA [2].

[1] https://news.ycombinator.com/item?id=39367141

[2] https://github.com/apple/axlearn

Etheryte2y ago

AJRF2y ago

> Absolutely no benchmarks against GPT4 present in the paper.

Table 4 on page 14 shows comparisons to GPT4V

1 more reply

pushedx2y ago

This has an unfortunate naming collision with the M/M/1 queue, a common stochastic model for the study of queueing theory.

https://en.wikipedia.org/wiki/M/M/1_queue

smokel2y ago

The paper lists "first authors", "core authors", and "senior authors".

My dream is to one day be listed on a seminal paper as "secondary forum reply author".

jebarker2y ago

Speaking as someone working in the field, I find it amusing how much researchers working on automating human work care about human credit assignment.

Eager2y ago

extremely underrated comment. nice one!

peddling-brink2y ago

Similarly, I’d like the movie credit Second Assistant to the Second Second Assistant Director.

Turing_Machine2y ago

"Junior Assistant Vice-Dean" (or variants thereof) in academia. Those mostly exist to give a pay boost to administrators who've otherwise maxed out on pay.

I recall that my undergrad institution once invented a new deanship out of whole cloth for a coach who'd maxed out on the "professor" pay scale.

Even worse, the bastard didn't even win games!

smokel2y ago

In that case, I highly recommend watching the movie Synecdoche New York (2008).

PS Can I be your hairdresser?

aspenmayer2y ago

https://en.wikipedia.org/wiki/Story_within_a_story

verticalscaler2y ago

Holy inferiority complex batman!

You can aspire higher and just use one of these LLMs to be a "first author" in a published peer reviewed paper.

a_vanderbilt2y ago

I wonder if this has anything to do with their acquisition of DarwinAI. After a decade of mediocrity, I'd love to see Siri get smarter. Any improvement would be welcome at this point.

Aqua_Geek2y ago

Honest question: what do you (in the general sense, not specifically asking the parent) use Siri for? I think my main (only?) use case is setting a timer.

azinman22y ago

I use it almost daily for something that is simple but under appreciated I don’t know why it’s not in every marketing video: “Siri, remind me tomorrow at 10am to do X”

Every time I do this around non-tech people they go “wow I didn’t know you could do that.” I swear it’s literally life changing, particularly for anyone over 30.

ribosometronome2y ago

2 more replies

MBCook2y ago

Yep. Reminders is #1 by far, followed by sending texts, turning lights on/off with HomeKit and timers which are similar.

I can’t imagine reminders w/o Siri because that’s how I add 90%+ of them. Grocery items, things to do at time X, or when I get to (or leave) work/home are the big ones.

seanmcdirmid2y ago

You can also ask Siri the weather in the morning, useful in figuring out how to dress the kid.

bombcar2y ago

If Siri could do the following reliably (meaning not having to ask again, not having to repeat, having it work 99% of the time) it would be golden:

1. Find my phone via Siri on homepod

2. Set a simple timer

3. Add to a list

4. Send a text message to one of a few contacts

It can and sometimes does do all of those things, but horribly unreliably.

csnweb2y ago

1 more reply

bionhoward2y ago

samatman2y ago

Personally, I don't want Siri to be 'smarter', if smarter means it becomes an open-ended and unpredictable way to have an LLM guess what I meant. I'd like Siri to be more powerful, yes.

sroussey2y ago

I agree. The whole push to have Siri work on device was a noble one, but I’d rather have the option for a dumber on device Siri or a smarter in the cloud Siri.

epaga2y ago

Mediocrity is far too positive a word for the dumpster fire that is Siri.

kstrauser2y ago

It’s not perfect, for sure:

Me: Hey Siri, turn off the kitchen lights.

Siri: I can’t process multiple requests.

Me: Hey Siri, turn off the kitchen lights.

Siri: OK.

But it works reliably enough that I use it all the time for the reminder and timer actions. Is it vastly worse for other people, and in what ways?

CharlesW2y ago

MBCook2y ago

The feeling I’ve heard from people is Alexa was way better than Siri at first.

Over time Siri got better. Not great but better. Alexa had mostly stayed the same or perhaps gotten a touch worse except for adding ads and other annoyances.

I’ve never used anything but Siri. It works decently, definitely has its moods/dumb-as-a-post moments. But I’ve learn what works well and for that it’s proven very useful.

georgespencer2y ago

Same transition some years ago. Siri is noticeably much, much worse to me. Borderline hopeless on a mixture of HomePod + HomePod Mini hardware.

erulabs2y ago

If it’s going to take general artificial intelligent to get a voice assistant that can remember not one, but two entirely separate cooking timers, then so be it. Imagine the GPUs required!

I’m still baffled at Siri and Google assistant. Virtually zero innovation in a decade. I just want to be able to turn on BBC radio while my hands are wet, is that really so hard?!

samatman2y ago

> that can remember not one, but two entirely separate cooking timers

You're in luck! Siri will do that right now. Just tried it. Works.

Rinzler892y ago

OMG, 2 cooking timers?! Pinnacle tech right there.

Knowing Apple, I was expecting one base timer, with every other timer being a $200 upgrade.

CharlesW2y ago

https://www.tomsguide.com/how-to/how-to-set-up-and-manage-mu...

“How many timers can you have going at one time? […] …I had 26 timers going at once, and the only reason I didn't have more running was because I got bored.”

1 more reply

mortenjorck2y ago

That’s not really Apple’s style. More along the lines of “HomePod mini 2 features double the RAM, allowing for exciting new features like multiple kitchen timers. Pre-orders start Friday.”

bestnameever2y ago

You should be able to do this with Siri. You can use a shortcut if it doesn't work out of the box.

MBCook2y ago

It works out of the box as of iOS 16 or 17.

“Hey Siri set an egg timer for 4 minutes”

But it works fine. And the interface is good on the phone.

ksubedi2y ago

Google Assistant is pretty decent. But as someone who is pretty much locked into the Apple ecosystem, Siri needs a reboot from scratch.

astrange2y ago

It's been reportedly rewritten from scratch like five times, during which time people have not stopped posting claims that it's exactly the same as it was in 2010.

astrange2y ago

You mostly think there's no innovation because you speak English.

BryanLegend2y ago

Trainig

dang2y ago

j / k navigate · click thread line to collapse