> We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought.
(They could certainly have been clearer about it -- I don't see anywhere they explicitly explain the CoT@k notation, but I'm pretty sure this is what they're referring to given that they report CoT@8 and CoT@32 in various places, and use 8 and 32 as the example numbers in the quoted paragraph. I'm not entirely clear on whether CoT@32 uses the 5-shot examples or not, though; it might be 0-shot?)
The 87% for GPT-4 is also with CoT@32, so it's more or less "fair" to compare that Gemini's 90% with CoT@32. (Although, getting to choose the metric you report for both models is probably a little "unfair".)
It's also fair to point out that with the more "standard" 5-shot eval Gemini does do significantly worse than GPT-4 at 83.7% (Gemini) vs 86.4% (GPT-4).
Chain of Thought prompting, as defined in the paper referenced, is a modification of few-shot prompting where the example q/a pairs used have chain-of-thought style reasoning included as well as the question and answer, so I don't think that, if they were using a 0-shot method (even if designed to elicit CoT-style output) they would call it Chain of Thought and reference that paper.
It would've been more consistent to call it e.g. "5-shot w/ CoT@32" in that case, but I guess there's only so much you can squeeze into a table.