undefined | Better HN

0 pointsaroo2y ago0 comments

Horrible comparison given one score was achieved using 32-shot CoT (Gemini) and the other was 5-shot (GPT-4).

0 comments

CoT@32 isn't "32-shot CoT"; it's CoT with 32 samples (or rollouts) from the model, and the answer is taken by consensus vote from those rollouts. It doesn't use any extra data, only extra compute. It's explained in the tech report here:

> We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought.

(They could certainly have been clearer about it -- I don't see anywhere they explicitly explain the CoT@k notation, but I'm pretty sure this is what they're referring to given that they report CoT@8 and CoT@32 in various places, and use 8 and 32 as the example numbers in the quoted paragraph. I'm not entirely clear on whether CoT@32 uses the 5-shot examples or not, though; it might be 0-shot?)

The 87% for GPT-4 is also with CoT@32, so it's more or less "fair" to compare that Gemini's 90% with CoT@32. (Although, getting to choose the metric you report for both models is probably a little "unfair".)

It's also fair to point out that with the more "standard" 5-shot eval Gemini does do significantly worse than GPT-4 at 83.7% (Gemini) vs 86.4% (GPT-4).

dragonwriter2y ago

> I'm not entirely clear on whether CoT@32 uses the 5-shot examples or not, though; it might be 0-shot?

Chain of Thought prompting, as defined in the paper referenced, is a modification of few-shot prompting where the example q/a pairs used have chain-of-thought style reasoning included as well as the question and answer, so I don't think that, if they were using a 0-shot method (even if designed to elicit CoT-style output) they would call it Chain of Thought and reference that paper.

throwaway2873912y ago

A-ha, thanks! Hadn't looked at or heard of the referenced paper, but yeah, sounds like it's almost certainly also 5-shot then.

It would've been more consistent to call it e.g. "5-shot w/ CoT@32" in that case, but I guess there's only so much you can squeeze into a table.

bitshiftfaced2y ago

The vibe I was getting from the paper was that they think something's funny about GPT4's 5-shot MMLU (e.g. possibly leakage into the training set).

j / k navigate · click thread line to collapse

0 comments

throwaway2873912y ago

It's also fair to point out that with the more "standard" 5-shot eval Gemini does do significantly worse than GPT-4 at 83.7% (Gemini) vs 86.4% (GPT-4).

dragonwriter2y ago

> I'm not entirely clear on whether CoT@32 uses the 5-shot examples or not, though; it might be 0-shot?

throwaway2873912y ago

A-ha, thanks! Hadn't looked at or heard of the referenced paper, but yeah, sounds like it's almost certainly also 5-shot then.

It would've been more consistent to call it e.g. "5-shot w/ CoT@32" in that case, but I guess there's only so much you can squeeze into a table.

bitshiftfaced2y ago

The vibe I was getting from the paper was that they think something's funny about GPT4's 5-shot MMLU (e.g. possibly leakage into the training set).

j / k navigate · click thread line to collapse