That said, I can’t seem to do better than just “vibes”. Basically, oh this model gave me a good response to this question, it must be better.
Now I have tried keeping track of a couple benchmarks like the ones I mentioned above. But I generally can’t translate these benchmarks into utility outside of the small scope the benchmark test for. Also there are so many benchmarks to keep track of and each takes some learning to understand.
So perhaps my scope isn’t well enough defined. But as a programmer, everything >GPT4o feels pretty damn similar.
Would love to hear how others evaluate LLMs beyond “just vibes” generally for programming use, but also when trying to use create new ai projects.