undefined | Better HN

0 pointsconsumer45111mo ago0 comments

Would love to see this repeated with this latest version from Google.

Man, what's really missing from all of this is a 3rd party AI Consumer Reports type site for all of these LLM tools. Whoever does this thing that does not scale will have a highly referenced site on their hands.

0 comments

jeffbee11mo ago

Throughout the entire 20th century the main determinant of a Consumer Reports rating for a car was whether you could put a wheelchair in the trunk. Hopefully the AI agent industry does not sprout a similarly worthless metric.

consumer451OP11mo ago

I almost didn't use that as the comparison for their lack of rigor, but it gets the idea across.

shigawire11mo ago

Isn't that what llmarena does?

consumer451OP11mo ago

It tries to in a way that scales easily, and is also easily gamed.

I want a staff of human testers, each with domain expertise. If the goal is to replace humans, should there not be a real human metric?

I want a physicist asking their battery of physics questions, 4 different kinds of devs asking their battery of dev problems, a couple chefs asking for cooking techniques, etc.

Now on to "Deep Research," 6 different kinds of OSINT/secondary analysts who ask new problems each time, and compare it to their days of human work.

We really need this as a species, otherwise the brain dead C-Suites of the world are going to keep buying the hype, which is often very premature. This could have real consequences, and it apparently already has.

It's insane to me that we are investing, what, almost $1T into LLMs, and have not spent the ~$1.5M/year to do what I described above.

consumer451OP11mo ago

^ I really should have used the "myopic," instead of "brain dead" to describe the C-Suites of the world. My apologies.

SequoiaHope11mo ago

I suppose Consumer Reports could do it!

j / k navigate · click thread line to collapse

0 comments

jeffbee11mo ago

consumer451OP11mo ago

I almost didn't use that as the comparison for their lack of rigor, but it gets the idea across.

shigawire11mo ago

Isn't that what llmarena does?

consumer451OP11mo ago

It tries to in a way that scales easily, and is also easily gamed.

I want a staff of human testers, each with domain expertise. If the goal is to replace humans, should there not be a real human metric?

I want a physicist asking their battery of physics questions, 4 different kinds of devs asking their battery of dev problems, a couple chefs asking for cooking techniques, etc.

Now on to "Deep Research," 6 different kinds of OSINT/secondary analysts who ask new problems each time, and compare it to their days of human work.

It's insane to me that we are investing, what, almost $1T into LLMs, and have not spent the ~$1.5M/year to do what I described above.

consumer451OP11mo ago

^ I really should have used the "myopic," instead of "brain dead" to describe the C-Suites of the world. My apologies.

SequoiaHope11mo ago

I suppose Consumer Reports could do it!

j / k navigate · click thread line to collapse