undefined | Better HN

0 pointssebzim45002y ago0 comments

It's not easy to compare them, to be fair.

I guess you could come up with a thousand example prompts and pay some students to pick which output is better, but I can also see why you wouldn't bother. It probably depends on language, type of prompt, etc.

0 comments

erwald2y ago

Sure it's easy -- you can use benchmarks like HumanEval, which Stability did. They just didn't compare to Codex or GPT-4. Of course such benchmarks don't capture all aspects of an LLM's capabilities, but they're a lot better than nothing!

maaaaattttt2y ago

One could team up with Hackerrank/leetcode, let the model code in the interface (maybe there's an API for that already, no idea), execute their code verbatim and see how many test cases they get right the first time around. Then, like for humans, give them a clue about one of the tests no passing (or code not working, too slow, etc.). Give points based on the difficulty of the question and the number of clues needed.

I guess the obvious caveat is that these model are probably overfitted on these types of questions. But a specific benchmark could be made containing question kept secret for models. Time to build "Botrank" I guess.

j / k navigate · click thread line to collapse

0 pointssebzim45002y ago0 comments

It's not easy to compare them, to be fair.

0 comments

erwald2y ago

maaaaattttt2y ago

j / k navigate · click thread line to collapse