This result, by itself, does not generalize to open-ended problems, though, whether in business or in research in general. Discovering the specification to build is often the majority of the battle. LLMs aren't bad at this, per se, but they're nowhere near as reliably groundbreaking as they are on verifiable problems.
Feel like it's a bit what I tried to expressed few weeks ago https://news.ycombinator.com/item?id=46791642 namely that we are just pouring computational resources at verifiable problems then claim that astonishingly sometimes it works. Sure LLMs even have a slight bias, namely they do rely on statistics so it's not purely brute force but still the approach is pretty much the same : throw stuff at the wall, see what sticks, once something finally does report it as grandiose and claim to be "intelligent".
What do we think humans are doing? I think it’s not unfair to say our minds are constantly trying to assemble the pieces available to them in various ways. Whether we’re actively thinking about a problem or in the background as we go about our day.
Every once in a while the pieces fit together in an interesting way and it feels like inspiration.
The techniques we’ve learned likely influence the strategies we attempt, but beyond all this what else could there be but brute force when it comes to “novel” insights?
If it’s just a matter of following a predefined formula, it’s not intelligence.
If it’s a matter of assembling these formulas and strategies in an interesting way, again what else do we have but brute force?
Instead of brute-forcing with infinite options, reduce the problem space by starting with some hunch about the mechanism. Then the hard part that can take decades: synthesize compounds with the necessary traits to alter the mechanism in a favourable way, while minimizing unintended side-effects.
Then try on a live or lab grown specimen and note effectiveness. Repeat the cycle, and with every success, push to more realistic forms of testing until it reaches human trials.
Many drugs that reach the last stage - human trials - often end up being used for something completely other than what they were designed for! One example of that is minoxidil - designed to regular blood pressure, used for regrowing hair!
Our actual software implementation is usually pretty simple; often writing up the design spec takes significantly longer than building the software, because the software isn't the hard part - the requirements are. I suspect the same folks who are terrible at describing their problems are going to need help from expert folks who are somewhere between SWE, product manager, and interaction designer.
RLHF is an attempt to push LLMs pre-trained with a dopey reconstruction loss toward something we actually care about: imagine if we could find a pre-training criterion that actually cared about truth and/or plausibility in the first place!