undefined | Better HN

0 pointsbenreesman2y ago0 comments

Keep in mind that modern quantitative approaches to LLM evaluation have been effectively co-designed with the rise of OpenAI, and folks like Ravenwolf routinely disagree with the leaderboards.

There's also very little if any credible literature on what constitutes statistically significant on MMLU or whatever. There's such a massive vested interest from so many parties (the YC ecosystem is invested in Sam, MSFT is invested in OpenAI, the US is invested in not-France, a bunch of academics are invested in GPT-is-borderline-AGI, Yud is either a Time Magazine cover author or a Harry Potter fanfic guy, etc.) in seeing GPT-4.5 at the top of those rankings and taking the bold one at < 10% lift as state of the art that I think everyone should just use a bunch of them and optimize per use case.

I have my own biases as well and freely admit that I love to see OpenAI stumble (no I didn't apply to work there, yes I know knuckleheads who go on about the fact they do).

And once you factor in "mixtral is aligned to the demands of the user and GPT balks at using profanity while happily taking sides on things Ilya has double-spoken on", even e.g. MMLU is nowhere near the whole picture.

It's easy and cheap to just try both these days, don't take my word for which one is better.

0 comments

icelancer2y ago

> It's easy and cheap to just try both these days, don't take my word for which one is better.

I literally use 8x-7b on my on-prem GPU cluster and have several fine tunes of 7b (which I said in the previous post). I've used mistral-medium.

GPT-4-turbo is better than them all on all benchmarks, human preference, and anything that isn't biased vibes. My opinion - such that it is - is that GPT-4-turbo is by far the best.

I have no vested interest in it being the best. I'd actually prefer if it wasn't. But all objective data points to it being the best and most lived experiences that are unbiased agree (assuming broad model use and not hyperfocused fine-tunes; I have Mistral-7b fine-tunes beating 4-turbo in very limited domains, but that hardly counts).

The rest of your post I really have no idea what's going on, so good luck with all that I guess.

MacsHeadroom2y ago

Mistral Medium beats 4.5 on the censorship benchmark. It doesn't refuse to help with anything that could be vaguely non-PC or could potentially be used to hurt anyone in the wrong hands, including dangerously hot salsa recipes.

wokwokwok2y ago

That's not a metric.

That's a use case.

Certainly, no one here is arguing that there are things openai refuses to allow, and given that the effectiveness of using GPT4 on them is literally zero, a sweet potato connected to a spring and keyboard will "beat" GPT-4, if that's your scoring metric.

If you want a meaningful comparison you need tasks that both tools are capable of doing, and then see how effective they are.

Claiming that mistral medium beats it is like me claiming the RenderMan beats DALLE2 at rendering 3d models; yes, technically they both generate images, but since it's not possible to use DALLE2 to render a 3d model, it's not really a meaningful comparison is it?

2 more replies

benreesmanOP2y ago

I'm a big proponent of freedom in this space (and remain one), but Dolphin is fucking scary.

I don't have any use cases for crime in my life at the moment beyond wanting to pirate like Adobe Illustrator before signing up for an uncancelable subscription, but it will do arbitrary things within it's abilities and it's google with a grudge in terms of how to do anything you ask. I stopped wanting to know when it convinced me it could explain how to stage a coup d'etat. I'm back on mixtral-8x7b.

dbuxton2y ago

Agree with this. I would say that the rate of progress from Mistral is very encouraging though in terms of having multiple plausible contenders for the crown.

epups2y ago

> Keep in mind that modern quantitative approaches to LLM evaluation have been effectively co-designed with the rise of OpenAI, and folks like Ravenwolf routinely disagree with the leaderboards.

Sorry but you're talking complete nonsense here. The benchmark by LMSys (chatbot arena) cannot be gamed, and Ravenwolf is a random-ass poster with no scientific rigor to his benchmarks.

ParetoOptimal2y ago

Cannot be gamed? C'mon now... You could pay a bunch of people to vote for your model in the arena.

epups2y ago

No you can't, because you actually don't know which model is which when you vote.

1 more reply

j / k navigate · click thread line to collapse

0 pointsbenreesman2y ago0 comments

Keep in mind that modern quantitative approaches to LLM evaluation have been effectively co-designed with the rise of OpenAI, and folks like Ravenwolf routinely disagree with the leaderboards.

I have my own biases as well and freely admit that I love to see OpenAI stumble (no I didn't apply to work there, yes I know knuckleheads who go on about the fact they do).

It's easy and cheap to just try both these days, don't take my word for which one is better.

0 comments

icelancer2y ago

> It's easy and cheap to just try both these days, don't take my word for which one is better.

I literally use 8x-7b on my on-prem GPU cluster and have several fine tunes of 7b (which I said in the previous post). I've used mistral-medium.

GPT-4-turbo is better than them all on all benchmarks, human preference, and anything that isn't biased vibes. My opinion - such that it is - is that GPT-4-turbo is by far the best.

The rest of your post I really have no idea what's going on, so good luck with all that I guess.

MacsHeadroom2y ago

wokwokwok2y ago

That's not a metric.

That's a use case.

If you want a meaningful comparison you need tasks that both tools are capable of doing, and then see how effective they are.

2 more replies

benreesmanOP2y ago

I'm a big proponent of freedom in this space (and remain one), but Dolphin is fucking scary.

dbuxton2y ago

Agree with this. I would say that the rate of progress from Mistral is very encouraging though in terms of having multiple plausible contenders for the crown.

epups2y ago

> Keep in mind that modern quantitative approaches to LLM evaluation have been effectively co-designed with the rise of OpenAI, and folks like Ravenwolf routinely disagree with the leaderboards.

Sorry but you're talking complete nonsense here. The benchmark by LMSys (chatbot arena) cannot be gamed, and Ravenwolf is a random-ass poster with no scientific rigor to his benchmarks.

ParetoOptimal2y ago

Cannot be gamed? C'mon now... You could pay a bunch of people to vote for your model in the arena.

epups2y ago

No you can't, because you actually don't know which model is which when you vote.

1 more reply

j / k navigate · click thread line to collapse