undefined | Better HN

0 pointsbredren1mo ago0 comments

We saw yesterday that expert orchestration around small, publicly available models can produce results on the level of the unreleased model.

I take a contra view and instead see this as fuel on the fire for tinkering to squeeze advanced functionality out of more available things.

It has always been like this, the amateur improvising tooling and equipment to outdo companies with comparably infinite resources.

0 comments

enraged_camel1mo ago

>> We saw yesterday that expert orchestration around small, publicly available models can produce results on the level of the unreleased model.

This is false. Yesterday's article did not actually show this, and there are many comments in the discussion from actual security people (like tptacek) pointing that out.

adrian_b1mo ago

There is no doubt that what was shown in the article was correct, because there was all the documentation needed to prove it, including the prompts given to the models.

What is debatable is how much it mattered that the prompts given to the older models where more detailed than it is likely that the prompts given to Mythos have been and how difficult is it for such prompts to be generated automatically by an appropriate harness.

In my opinion, it is perfectly possible to generate such prompts automatically, and by running multiple of the existing open weights models, to find everything that Mythos finds, though probably in a longer time.

Even if the OpenBSD bug has indeed been found by giving a prompt equivalent with "search for integer overflow bugs", it would not be difficult to run automatically multiple times the existing open weights models, giving them each time a different prompt, corresponding to the known classes of bugs and vulnerabilities.

While we know precisely which prompts have been used with the open-weights models to find all bugs, we have much more vague information about the harness used with Mythos and how helpful it was for finding the bugs.

Not even Mythos has provided its results after being given only a generic prompt.

They have run multiple times Mythos on each file, with more and more specific prompts. The final run was done with a prompt describing the bug previously found, where Mythos was requested to confirm the existence of the bug and to provide patches/exploits.

See: https://red.anthropic.com/2026/mythos-preview/

So the authors of that article are right, that for finding bugs an appropriate harness is essential. Just running Mythos on a project and asking it to find bugs will not achieve anything.

bredrenOP1mo ago

From what I can tell, this was not clearly settled.

Your example author, actually corrected themselves saying LLMs “possibly” could perform successfully: https://news.ycombinator.com/item?id=47732696

enraged_camel1mo ago

>> We already know this is not true, because small models found the same vulnerability.

>> No, they didn't. They distinguished it, when presented with it. Wildly different problem.

https://news.ycombinator.com/item?id=47733343

1 more reply

j / k navigate · click thread line to collapse

0 comments

enraged_camel1mo ago

>> We saw yesterday that expert orchestration around small, publicly available models can produce results on the level of the unreleased model.

This is false. Yesterday's article did not actually show this, and there are many comments in the discussion from actual security people (like tptacek) pointing that out.

adrian_b1mo ago

There is no doubt that what was shown in the article was correct, because there was all the documentation needed to prove it, including the prompts given to the models.

Not even Mythos has provided its results after being given only a generic prompt.

See: https://red.anthropic.com/2026/mythos-preview/

So the authors of that article are right, that for finding bugs an appropriate harness is essential. Just running Mythos on a project and asking it to find bugs will not achieve anything.

bredrenOP1mo ago

From what I can tell, this was not clearly settled.

Your example author, actually corrected themselves saying LLMs “possibly” could perform successfully: https://news.ycombinator.com/item?id=47732696

enraged_camel1mo ago

>> We already know this is not true, because small models found the same vulnerability.

>> No, they didn't. They distinguished it, when presented with it. Wildly different problem.

https://news.ycombinator.com/item?id=47733343

1 more reply

j / k navigate · click thread line to collapse