undefined | Better HN

0 pointsskizm1y ago0 comments

This might sound dumb, and I'm not sure how to phrase this, but is there a way to measure the raw model output quality without all the more "traditional" engineering work (mountain of `if` statements I assume) done on top of the output? And if so, would that be a better measure of when scaling up the input data will start showing diminishing returns?

(I know very little about the guts of LLMs or how they're tested, so the distinction between "raw" output and the more deterministic engineering work might be incorrect)

0 comments

whimsicalism1y ago

what do you mean by the mountain of if-statements on top of the output? like checking if the output matches the expected result in evaluations?

skizmOP1y ago

Like when you type something into the chat gpt app I am guessing it will start by preprocessing your input, doing some sanity checks, making sure it doesn’t say “how do I build a bomb?” or whatever. It may or may not alter/clean up your input before sending it to the model for processing. Once processed, there’s probably dozens of services it goes through to detect if the output is racist, somehow actually contained a bomb recipe, or maybe copywriter material, normal pattern matching stuff, maybe some advanced stuff like sentiment analysis to see if the output is bad mouthing Trump or something, and it might either alter the output or simply try again.

I’m wondering when you strip out all that “extra” non-model pre and post processing, if there’s someway to measure performance of that.

whimsicalism1y ago

oh, no - but most queries aren’t being filtered by supervisor models nowadays anyways.. most of the refusal is baked in

j / k navigate · click thread line to collapse

0 pointsskizm1y ago0 comments

(I know very little about the guts of LLMs or how they're tested, so the distinction between "raw" output and the more deterministic engineering work might be incorrect)

0 comments

whimsicalism1y ago

what do you mean by the mountain of if-statements on top of the output? like checking if the output matches the expected result in evaluations?

skizmOP1y ago

I’m wondering when you strip out all that “extra” non-model pre and post processing, if there’s someway to measure performance of that.

whimsicalism1y ago

oh, no - but most queries aren’t being filtered by supervisor models nowadays anyways.. most of the refusal is baked in

j / k navigate · click thread line to collapse