> It's a different kind of tool doing a different kind of work, and that makes a clean apples-to-apples comparison to earlier models difficult.
They claim it’s a different kind of tool and then describe using it the same way you’d use any other model. This really felt way worse than the average Cloudflare blog and really just rehashed the Mythos announcement which had already called out the key parts being chaining and crafting examples.
Hah, I was trying to parse this too.
Charitably perhaps they're being vague on exactly what's different because they're still under NDA.
How long has it been since you took your average? Lately all Cloudflare output has been heavily AI'd.
Their owners are invested in AI and need AI to do well. If this goal clashes temporarily with the goals set up for Cloudflare, so be it.
Yep. Cloudflare has lost my respect over the last six months.
The posts about pro-AI initiatives and APIs for AI and then laying off a lot of people was pretty impressive for how to do the wrong thing.
The post takes a while to get around to saying that, and could have included more detail besides the workflow diagram and table (which they flag as only "an example of" such a harness), but it does answer the question. It's a different kind of tool because it's a model rather than a harness+model pair.
This was new. I'm surprised that a model specifically designed for security research and gated to professionals is refusing legitimate requests
You're right that they're using a harness like everyone else. The general idea of giving the model a harness is not going to change. I mean even humans need harnesses to accomplish some things.
Because of it's capabilities, a new kind of harness can be built for it, thus the entire system (model + harness) is a different kind of tool than say Claude code
[1] https://xbow.com/blog/mythos-offensive-security-xbow-evaluat...
Really this is why the LLM needs to be able to write exploits for issues it finds. Of course that leads down a rabbit hole of other issues. But if an exploit works, then that's pretty conclusive evidence.
Interesting that gpt-5.5, while not as good as mythos, also seems like a decent step up
> "Why it matters"
It doesn't, it's a corporate blog, they were rarely written in one-author's voice anyway, but it's interesting to see that even large organisations are outsourcing their blogs to LLMs.
I will upgrade the "why it matters" to "and now AI output is part of the training data". A day is coming when the punched-up AI verbiage will be the norm and hard to distinguish unless you're from the previous generation. Sort of in the way that I miss some aspects of Usenet.
Seems stifling. We'll need someway to reward human creativity and out-of-bounds thinking before our greatest corpus of human intellect is a bounded by whenever and whatever was trained on.
It's like staring down the barrel of a gun and taking the time to make quips about the type of paper the gun advertisement was printed on.
So what, we take every function and every vulnerability type and just run the agents millions of times?
I would expect Mythos to be able to find vulnerabilities without pointing it out for him, otherwise it's no better from other agents. It's just has a better harness.
Yes.
We build a skill where a coordinator AI enumerates all possible vulnerability types and all functions, then launches parallel max effort Mythos agents against all vulnerability x function pairs.
I've been doing something like this with Opus already. General code review. Enumerated dimensions like correctness, security, maintainability, etc. Asked the coordinator AI to explore the code and autodiscover subsystem boundaries. Then it runs an absurd amount of dimension x subsystem review agents.
It burns a lot of tokens and takes me like three days to complete a review session, but the results have been excellent so far. The resulting TODO list will keep me occupied for quite a while.
I can only imagine what these corporations with unlimited money are doing. Poor me can't afford API prices so I had to not only limit scope but also design a filesystem-like journaling mechanism for the agents in order to deal with the rate limit interruptions. I'm sure Cloudflare is not gonna have that problem.
And note that Hunt tasks can be queued from previous Trace tasks, ie you find a vuln in one layer, so you queue a hunt for corresponding vulns in the layers that could exploit your first finding.
But, I did think the adversarial review (while not novel at all and talked about much in HN circles) is interesting and distinct, at least. I need to put this to work in more of workflows. I think it could be beneficial for non-coding tasks, too.
https://blog.cloudflare.com/cyber-frontier-models/#what-a-ha...
Over time, I wonder if these models will be able to generate more secure code by default by doing this kind of exploitability testing before ever merging their code.
* they, I mean all foundation models providers, as OpenAI seems to go in the same direction
Lots of people feel that Mythos is a psyops campaign, but I don’t really understand the skepticism. Most of it seems to stem from the general distrust of things that aren’t publicly available.
A few Anthropic employees have described Mythos as a general purpose model improvement, but that claim has yet to be widely backed up so that’s the only place I’m remaining skeptical.
For the domain of security research, I’m willing to buy the narrative.
I get that you want to address them or whatever before releasing info but I keep seeing these claims with barely any data and I’m like…how do you expect people to not be skeptical?
I mean hell if you’re a security professional you’re literally paid to be skeptical.
I think this statement seems to align with some of the other independent tests of Mythos[1]. It did very well on long agentic work which I expect is what they trained it for, and that requires being able to find these tangential links between loosely related topics in the context window.
[1] I'm mainly referring to https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos...
Re-write your Rust into C++ to drown the attacker in false positives? ;)
Claude Code's harness is remarkable for many use cases, particularly with 1M context sizes. But it's also limited when the scale of code or data to read becomes close to that, or exceeds it. The idea that a cluster of actors can work on a shared, structured set of context snippets, and have guidance around what is relevant to them, is an incredibly useful model outside of cybersecurity as well.
LLMs are trained on Ed Sheeran lyrics
So nothing new then.
Kringe sloppy AI writing.
I have been encouraging people to think about agentic coding in the same way.
Let agents do the reading and writing and inspections. Human does the thinking.
Asking an agent that is looking at a firearm specification schematic "what is wrong with this?" and the response is "this thing contains an explosion and can kill". Human "that's the function" when the human should be asking "based upon the materials used, are the fault tolerances sufficient to maintain structural integrity".
I expressed some concerns along the same lines in the thread about the Mythos evaluation curl did a few days ago, which sounded a lot like the "passing in the repo and telling it go!" type workflow described in this as dramatically less effective.
Disappointed that the post is very slim on details beyond this however. No hard numbers. Not comparatively, not in isolation. Would have arguably been kinda the point.
I don't think guardrails are useful long term. Assuming we don't see the end of open near-frontier models, it is folly to try to keep models from doing exploit generation. The solution needs to be all software projects writing code under the assumption that hackers will be running LLMs against their code in search of exploits and write secure code accordingly.
but I agree that guardrails will only help for like, 3-6 months. we should be screening as much as we can with Mythos; unfortunately, Anthropic is only giving access to the big players.
I think the curl folks finding it underwhelming is more of a testament to their code being subjected to a lot of tests/attacks/auditing over the past years compared to many other codebases. It's not going to find magically insurmounable exploits on it's own and "pwn teh w0rld".
At the same time, there is so much shitty non-memory safe code out there (C/C++ mainly) or logically weak code (much of it vibe-coded or otherwise by inexperienced devs) that will be easy pickings for anyone pointing Mythos at those codebases/services and eventually lead to chaos since the cost of an customized exploit has gone from days to months of expensive researcher time to some token spending.
Now if they noticed that they could find exploit chains easily in a lot of popular software, some embargo and hardening to give popular OSS packages time to not be exploitable by default does help people (and the NSA that probably has a preview).
"We saw consistently more false positives from projects written in memory-unsafe languages."
So while there may be a greater probability to find bugs in C/C++ projects, there is also a greater probability that there will be more work that must be done by humans to verify that real bugs have been found.
Static scanners are ok at find a few particular types of issues, and really bad at more abstract issues. Also having rules where you must pass static analysis has to be followed up with actually making sure your code monkeys aren't writing bullshit that confuses the scanner and lets it pass while doing nothing for security (or adding nice logic traps).
Most external security firms looking at code are more useless than a zero with the circle rubbed out. Had a fun example from a while back where the team that wrote the code inserted an intentional security flaw to be sure they were catching anything. Problem is they were giving access to the entire git history so these stood out. The moment they just gave flat code the security teams ability to find flaws disappeared.
LLM models seem to have a pretty good grasp on finding flaws in code like this once you can get the issue to stay in context and execution time. When I hear things like Mythos getting much longer time to work on the problem then at least to me it makes a lot more sense on the number of issues it's picking up.
That even their model aimed at security research tries to be a pedantic better-than-thou annoys me much.
I build an agentic loop framework at work, and I need the model to test some boundaries and error-mechanisms, but Opus keeps whining that it's not ready to do these "bad" things and tells me to do it myself instead. Makes me roll my eyes...
I’m a security researcher
“Oh in that case”
The author of this blog post does not acknowledge the existence of subagents and thinks that it's not possible for a model to come up with multiple ideas and have multiple streams of thought at the same time.
This is something I've been anticipating. Imagine this happening on a 500k+ line project scattered across 10+ repos.
It would be easier and cheaper to pay me to rewrite the whole thing from scratch than to fix all the vulnerabilities.