What is up with that eval @32? Am I reading it correctly that they are generating 32 responses and taking majority? Who will use the API like that? That feels like such a fake way to improve metrics
Page 7 of their technical report [0] has a better apples to apples comparison. Why they choose to show apples to oranges on their landing page is odd to me.