undefined | Better HN

0 pointsalbertzeyer2y ago0 comments

So, better than GPT4 according to the benchmarks? Looks very interesting.

Technical paper: https://goo.gle/GeminiPaper

Some details:

- 32k context length

- efficient attention mechanisms (for e.g. multi-query attention (Shazeer, 2019))

- audio input via Universal Speech Model (USM) (Zhang et al., 2023) features

- no audio output? (Figure 2)

- visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al., 2022)

- output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b)

- supervised fine tuning (SFT) and reinforcement learning through human feedback (RLHF)

I think these are already more details than what we got from OpenAI about GPT4, but on the other side, still only very little details.

0 comments

Palmik2y ago

The table is *highly* misleading. It uses different methodologies all over the place.

For MMLU, it highlights the CoT @ 32 result, where Ultra beats GPT4, but it loses to GPT4 with 5-shot, for example.

For GSM8K it uses Maj1@32 for Ultra and 5-shot CoT for GPT4, etc.

Then also, for some reason, it uses different metrics for Ultra and Pro, making them hard to compare.

What a mess of a "paper".

Imnimo2y ago

It really feels like the reason this is being released now and not months ago is that that's how long it took them to figure out the convoluted combination of different evaluation procedures to beat GPT-4 on the various benchmarks.

mring336212y ago

"Dearest LLM: Given the following raw benchmark metrics, please compose an HTML table that cherry-picks and highlights the most favorable result in each major benchmark category"

rvnx2y ago

And somehow, when reading the benchmarks, Gemini Pro seems to be a regression compared to PaLM 2-L (the current Bard) :|

eurekin2y ago

This, and also building the marketing website.

It feels really desperate

1 more reply

hulium2y ago

Why is that misleading? It shows Gemini with CoT is the best known combination of prompt and LLM on MMLU.

They simply compare the prompting strategies that work best with each model. Otherwise it would be just a comparison of their response to specific prompt engineering.

noway4212y ago

> They simply compare the prompting strategies that work best with each model

Incorrect.

# Gemini marketing website, MMLU

- Gemini Ultra 90.0% with CoT@32*

- GPT-4 86.4% with 5-shot* (reported)

# gemini_1_report.pdf, MMLU

- Gemini Ultra 90.0% with CoT@32*

- Gemini Ultra 83.7% with 5-shot

- GPT-4 87.29% with CoT@32 (via API*)

- GPT-4 86.4% with 5-shot (reported)

Gemini marketing website compared best Gemini Ultra prompting strategy with a worse-performing (5-shot) GPT-4 prompting strategy.

viscanti2y ago

The places where they use the same methodology seem within the error bars of the cherry picked benchmarks they selected. Maybe for some tasks it's roughly comparable to GPT4 (still a major accomplishment for Google to come close to closing the gap for the current generation of models), but this looks like someone had the goal of showing Gemini beating GPT4 in most areas and worked back from there to figure out how to get there.

pcshah19962y ago

Some discussion on twitter about misleading evaluation: https://twitter.com/a_a_cabrera/status/1732454328307511807

(nitter: https://nitter.net/a_a_cabrera/status/1732454328307511807#m)

ilaksh2y ago

That's for Ultra right? Which is an amazing accomplishment, but it sounds like I won't be able to access it for months. If I'm lucky.

freedomben2y ago

Yep, at this point I'd rather they hold their announcements until everybody can access it, not just the beautiful people. I'm excited and want to try it right now, and would actually use it for a PoC I have in mind, but in a few months the excitement will be gone.

jakderrida2y ago

It's to their detriment, also. Being told Gemini beats GPT-4 while withholding that what I'm trying out is not the model they're talking about would have me think they're full of crap. They'd be better off making it clear that this is not the one that surpasses GPT-4.

1 more reply

Maxion2y ago

Yep, the announcement is quite cheeky.

Ultra is out sometime next year, with GPT-4 level capability.

Pro is out now (?) with ??? level capability.

KaoruAoiShiho2y ago

Pro benchmarks are here: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

Sadly it's 3.5 quality, :(

3 more replies

OscarTheGrinch2y ago

??? Capability, sometime next year, welcome to the Gemini era.

onlyrealcuzzo2y ago

The article says "next year" - so that could be as soon as January, right?

borg162y ago

given how google has been functioning, probably as late as December :)

1 more reply

verdverm2y ago

There was a waiting period for ChatGPT4 as well, particularly direct API access, and the WebUI had (has?) a paywall

behnamoh2y ago

I hate this "tierification" of products into categories: normal, pro, max, ultra

Apple does this and it's obvious that they do it to use the "decoy effect" when customers want to shop. Why purchase a measly regular iPhone when you can spend a little more and get the Pro version?

But when it comes to AI, this tierification only leads to disappointment—everyone expects the best models from the FAANGO (including OpenAI), no one expects Google or OpenAI to offer shitty models that underperform their flagships when you can literally run Llama 2 and Mistral models that you can actually own.

crazygringo2y ago

I don't understand -- these are all literally tied directly to performance.

They're tiers of computing power and memory. More performance costs more money to produce. The "nano" can fit on a phone, while the others can't.

Are you really objecting to the existence of different price/performance tiers...? Do you object to McDonald's selling 3 sizes of soft drink? There's nothing "decoy" about any of this.

1 more reply

chongli2y ago

No, it’s not just to use the “decoy effect.” They do this to share development costs across a whole product line. Low volume, expensive products are subsidized by high volume, mass market devices. Without these tiers, they’d be unable to differentiate the products and so lose the margins of the high end products (and their entire reason for existing).

Unless you expect Apple to just sell the high end devices at a loss? Or do you want the high end chips to be sold in the mass market devices and for Apple to just eat the R&D costs?

1 more reply

pphysch2y ago

This isn't "tierificaton" or even premiumization. That may come later.

Large AI models have tight resources requirements. You physically can't use X billion parameters without ~X billion ~bytes of memory.

It makes complete sense to have these 3 "tiers". You have a max capability option, a price-performance scaling option, and an edge compute option.

1 more reply

mensetmanusman2y ago

It has to be this way when current LLMs have orders of magnitude electricity cost differences depending on the output you desire.

golol2y ago

Tierification of AI models is not some business strategy, it is a necessary consequence of the reality that AI is massively compute constrained right now. The size of a model is extremely important for inference time and cost. It just doesn't make sense to release one single model when your method will always yield a family of models with increasing size. The customer can choose a model corresponding to their needs.

giovannibonetti2y ago

I think the expensive ones are used when the customer is the user — e.g. ChatGPT Plus (personal) subscription — and the cheap ones when they are not — e.g. customer support service bots.

jchw2y ago

I'm honestly 100% okay with it as long as it's reasonable and not confusing to customers. (Not saying Apple isn't somewhat; I mean, buying a non-Pro iPhone 15 and not being able to view WebM files feels literally fucking insane, and that's apparently how that works, but that's a rant for a different thread.) In cases like this, presumably the idea isn't actually feature-gating, it's scaling up. AI inference costs compute time, and although I have no idea if the inference occurs on special hardware or not, if it does, I can only presume that scaling up the special hardware to meet demand is challenging and very much not like scaling up e.g. a typical web service.

IMO, Tiers can be useful when they make sense and aren't just for artificial market segmentation.

nkohari2y ago

My guess is they're branding it in this way to obfuscate the number of parameters used, which makes sense because more parameters doesn't necessarily mean a better model. It's kind of like the "number of bits" competition in video game consoles back in the 90s.

dankle2y ago

I think it depends. It's always worth having a small fast model for some tasks and being able to run it completely offline on a mobile cpu. Maybe not as a chat companion, for for text understanding or indexing all your messages and photos for search, it may be enough.

theonlybutlet2y ago

It's safe to assume there's good reason in this case. Nano runs locally on smartphones. Pro and Ultra will likely be cost and speed.

city_guy_12y ago

More expensive things cost more money, not a surprise imo

arnaudsm2y ago

I miss when ML scientific papers had actual science in them. Now they all feel like ads.

behnamoh2y ago

That's because they're not "scientific papers", they're technical papers.

OscarTheGrinch2y ago

It's aimed at the AI pontification industrial complex.

miraculixx2y ago

If it feals like a duck it is a duck. Duh(ck)

yieldcrv2y ago

I wish Google’s UI would have the same chat interface as ChatGPT or even command line ones I’ve encountered

j / k navigate · click thread line to collapse

0 comments

Palmik2y ago

The table is *highly* misleading. It uses different methodologies all over the place.

For MMLU, it highlights the CoT @ 32 result, where Ultra beats GPT4, but it loses to GPT4 with 5-shot, for example.

For GSM8K it uses Maj1@32 for Ultra and 5-shot CoT for GPT4, etc.

Then also, for some reason, it uses different metrics for Ultra and Pro, making them hard to compare.

What a mess of a "paper".

Imnimo2y ago

mring336212y ago

"Dearest LLM: Given the following raw benchmark metrics, please compose an HTML table that cherry-picks and highlights the most favorable result in each major benchmark category"

rvnx2y ago

And somehow, when reading the benchmarks, Gemini Pro seems to be a regression compared to PaLM 2-L (the current Bard) :|

eurekin2y ago

This, and also building the marketing website.

It feels really desperate

1 more reply

hulium2y ago

Why is that misleading? It shows Gemini with CoT is the best known combination of prompt and LLM on MMLU.

They simply compare the prompting strategies that work best with each model. Otherwise it would be just a comparison of their response to specific prompt engineering.

noway4212y ago

> They simply compare the prompting strategies that work best with each model

Incorrect.

# Gemini marketing website, MMLU

- Gemini Ultra 90.0% with CoT@32*

- GPT-4 86.4% with 5-shot* (reported)

# gemini_1_report.pdf, MMLU

- Gemini Ultra 90.0% with CoT@32*

- Gemini Ultra 83.7% with 5-shot

- GPT-4 87.29% with CoT@32 (via API*)

- GPT-4 86.4% with 5-shot (reported)

Gemini marketing website compared best Gemini Ultra prompting strategy with a worse-performing (5-shot) GPT-4 prompting strategy.

viscanti2y ago

pcshah19962y ago

Some discussion on twitter about misleading evaluation: https://twitter.com/a_a_cabrera/status/1732454328307511807

(nitter: https://nitter.net/a_a_cabrera/status/1732454328307511807#m)

ilaksh2y ago

That's for Ultra right? Which is an amazing accomplishment, but it sounds like I won't be able to access it for months. If I'm lucky.

freedomben2y ago

jakderrida2y ago

1 more reply

Maxion2y ago

Yep, the announcement is quite cheeky.

Ultra is out sometime next year, with GPT-4 level capability.

Pro is out now (?) with ??? level capability.

KaoruAoiShiho2y ago

Pro benchmarks are here: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

Sadly it's 3.5 quality, :(

3 more replies

OscarTheGrinch2y ago

??? Capability, sometime next year, welcome to the Gemini era.

onlyrealcuzzo2y ago

The article says "next year" - so that could be as soon as January, right?

borg162y ago

given how google has been functioning, probably as late as December :)

1 more reply

verdverm2y ago

There was a waiting period for ChatGPT4 as well, particularly direct API access, and the WebUI had (has?) a paywall

behnamoh2y ago

I hate this "tierification" of products into categories: normal, pro, max, ultra

Apple does this and it's obvious that they do it to use the "decoy effect" when customers want to shop. Why purchase a measly regular iPhone when you can spend a little more and get the Pro version?

crazygringo2y ago

I don't understand -- these are all literally tied directly to performance.

They're tiers of computing power and memory. More performance costs more money to produce. The "nano" can fit on a phone, while the others can't.

Are you really objecting to the existence of different price/performance tiers...? Do you object to McDonald's selling 3 sizes of soft drink? There's nothing "decoy" about any of this.

1 more reply

chongli2y ago

Unless you expect Apple to just sell the high end devices at a loss? Or do you want the high end chips to be sold in the mass market devices and for Apple to just eat the R&D costs?

1 more reply

pphysch2y ago

This isn't "tierificaton" or even premiumization. That may come later.

Large AI models have tight resources requirements. You physically can't use X billion parameters without ~X billion ~bytes of memory.

It makes complete sense to have these 3 "tiers". You have a max capability option, a price-performance scaling option, and an edge compute option.

1 more reply

mensetmanusman2y ago

It has to be this way when current LLMs have orders of magnitude electricity cost differences depending on the output you desire.

golol2y ago

giovannibonetti2y ago

I think the expensive ones are used when the customer is the user — e.g. ChatGPT Plus (personal) subscription — and the cheap ones when they are not — e.g. customer support service bots.

jchw2y ago

IMO, Tiers can be useful when they make sense and aren't just for artificial market segmentation.

nkohari2y ago

dankle2y ago

theonlybutlet2y ago

It's safe to assume there's good reason in this case. Nano runs locally on smartphones. Pro and Ultra will likely be cost and speed.

city_guy_12y ago

More expensive things cost more money, not a surprise imo

arnaudsm2y ago

I miss when ML scientific papers had actual science in them. Now they all feel like ads.

behnamoh2y ago

That's because they're not "scientific papers", they're technical papers.

OscarTheGrinch2y ago

It's aimed at the AI pontification industrial complex.

miraculixx2y ago

If it feals like a duck it is a duck. Duh(ck)

yieldcrv2y ago

I wish Google’s UI would have the same chat interface as ChatGPT or even command line ones I’ve encountered

j / k navigate · click thread line to collapse