undefined | Better HN

0 pointscowpig7mo ago0 comments

Compare these rankings to actual usage: https://openrouter.ai/rankings

Claude is not cheap, why is it far and away the most popular if it's not top 10 in performance?

Qwen3 235b ranks highest on these benchmarks among open models, but I have never met someone who prefers its output over Deepseek R1. It's extremely wordy and often gets caught in thought loops.

My interpretation is that the models at the top of ArtificialAnalysis are focusing the most on public benchmarks in their training. Note I am not saying XAI is necessarily nefariously doing this, could just be that they decided it's better bang for the buck to rely on public benchmarks than to try to focus on building their own evaluation systems.

But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.

0 comments

threeducks7mo ago

OpenRouter rankings conflate many factors like output quality, popularity, price and legal concerns. They can not tell us whether a model is popular because it is genuinely good, or because many people have heard about it, or because it is free, or because the lawyers trust the provider.

byefruit7mo ago

The openrouter rankings can be biased.

For example, Google's inexplicable design decisions around libraries and APIs means it's often worth the 5% premium to just use OpenRouter to access their models. In other cases it's about which models particular agents default to.

Sonnet 4 is extremely good for tool-usage agentic setups though - something I have found other models struggle to do over a long-context.

1 more reply

GaggiX7mo ago

Claude Opus is in the top 10, also people via OpenRouter mostly use these models for coding and Claude models are particularly good at this, the benchmark doesn't account only for coding capacities tho

ImageXav7mo ago

Thanks for sharing that. Interesting that the leaderboard is dominated by Anthropic, Google and DeepSeek. Openai doesn't even register.

reilly30007mo ago

OpenAI has a lot of share that simply doesn’t exist via OpenRouter. Typical enterprise chat bot apps use it directly without paying a tax and may use litellm with another vendor for fallback.

esafak7mo ago

I shared a link to small, open source models; Claude is neither.

wkat42427mo ago

> But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.

That's political I think. I know several alt right types that swear by grok because "Elon doesn't give it any of that woke crap". They don't care that there's better, for them it's the only viable option.

whimsicalism7mo ago

grok is not bad, i think 4 is better than claude for most things other than tool calling.

of course, this is a politically charged subject now so fair assessments might be hard to come by - as evidenced by the downvotes i've already gotten on this comment

j / k navigate · click thread line to collapse