It reveals how good LLM use, like any other engineering tool, requires good engineering thinking – methodical, and oriented around thoughtful specifications that balance design constraints – for best results.
> In fact my entire system prompt is speculative so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering
Just like Eisenhower's famous "plans are useless, planning is indispensable" quote. The muscle you build is creating new plans, not memorizing them.
It all seems like vibes-based incantations. "You are an expert at finding vulnerabilities." "Please report only real vulnerabilities, not any false positives." Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?
> In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.
1. Having workflows to be able to provide meaningful context quickly. Very helpful.
2. Arbitrary incantations.
I think No. 2 may provide some random amounts of value with one model and not the other, but as a practitioner you shouldn't need to worry about it long-term. Patterns models pay attention to will change over time, especially as they become more capable. No. 1 is where the value is at.
As my example as a systems grad student, I find it a lot more useful to maintain a project wiki with LLMs in the picture. It makes coordinating with human collaborators easier too, and I just copy paste the entire wiki before beginning a conversation. Any time I have a back-and-forth with an LLM about some design discussions that I want archived, I ask them to emit markdown which I then copy paste into the wiki. It's not perfectly organized but it keeps the key bits there and makes generating papers etc. that much easier.
The author deserves more credit here, than just "vibing".
I use these prompts everywhere. I get significantly better results mostly because it encourages backtracking and if I were to guess, enforces a higher confidence threshold before acting.
The expert engineering ones usually end up creating mountains of slop, refactoring things, and touching a bunch of code it has no business messing with.
I also have used lazy prompts: "You are positively allergic to rewriting anything that already exists. You have multiple mcps at your disposal to look for existing solutions and thoroughly read their documentation, bug reports, and git history. You really strongly prefer finding appropriate libraries instead of maintaining your own code"
You just described one critical aspect of engineering: discovering a property of a system and feeding that knowledge back into a systematic, iterative process of refinement.
But yeah prompt engineering is a field for a reason, as it takes time and experience to get it right.
Problem with LLMs as well is that it’s inherently probabilistic, so sometimes it’ll just choose an answer with a super low probability. We’ll probably get better at this in the next few years.
Quantitative benchmarks are not necessary anyway. A method either gets results or it doesn't.
Those prompts should be renamed as hints. Because that's all they are. Every LLM today ignores prompts if they conflict with its sole overarching goal: to give you an answer no matter whether it's true or not.
I like to think of them as beginnings of an arbitrary document which I hope will be autocompleted in a direction I find useful... By an algorithm with the overarching "goal" of Make Document Bigger.
(As an engineer it’s part of your job to know if the problem is being solved correctly.)
You invoke "engineering principles", but software engineers constantly trade in likelihoods, confidence intervals, and risk envelopes. Using LLMs is no different in that respect. It's not rocket science. It's manageable.
What's the alternative?
Are you Insinuating that dealing with unstable and unpredictable systems isn't somewhere engineering principles are frequently applied to solve complex problems?
It’s surprisingly effective to ask LLMs to help you write prompts as well, i.e. all my prompt snippets were designed with help of an LLM.
I personally keep them all in an org-mode file and copy/paste them on demand in a ChatGPT chat as I prefer more “discussion”-style interactions, but the approach is the same.
Works incredibly well, and I created it with its own help.
The more you can frame the problem with your expertise, the better the results you will get.
Every time a new frontier LLM is released (excluding LLMs that use input as training data) I run the interview questions through it. I’ve been surprised that my rate of working responses remains consistently around 1:10 for the first pass, and often takes upwards of 10 rounds of poking to get it to find its own mistakes.
So this level of signal to noise ratio makes sense for even more obscure topics.
Interviewees don't get to pick the language?
If you're hiring based on proficiency in a particular tech stack, I'm curious why. Are there that many candidates that you can be this selective? Is the language so dissimilar that the uninitiated would need a long time to get up to speed? Does the job involve working on the language itself and so a specifically deep understanding is required?
We’ve found a wide range of results and we have a conference talk coming up soon where we’ll be releasing everything publicly so stay tuned for that itll be pretty illuminating on the state of the space
Edit: confusing wording
Wouldn't such an LLM be the closer -synth- version of a person who has worked on a codebase for years, learnt all its quirks etc.
There's so much you can fit on a high context, some codebases are already 200k Tokens just for the code as is, so idk
You can spend all day reading slop or you can get good at this yourself and be much more efficient at this task. Especially if you're the developer and know where to look and how things work already, catching up on security issues relevant to your situation will be much faster than looking for this needle in the haystack that is LLM output
And it will do this no matter how many prompts you try or you forcefully you ask it.
Designing and building meaningfully testable non-trivial software is orders of magnitude more complex than writing the business logic itself. And that’s if you compare writing greenfield code from scratch. Making an old legacy code base testable in a way conducive to finding security vulns is not something you just throw together. You can be lucky with standard tooling like sanitizers and valgrind but it’s far from a panacea.
When the cost of inference gets near zero, I have no idea what the world of cyber security will look like, but it's going to be a very different space from today.
The "don't blame the victim" trope is valid in many contexts. This one application might be "hackers are attacking vital infrastructure, so we need to fund vulnerabilities first". And hackers use AI now, likely hacked into and for free, to discover vulnerabilities. So we must use AI!
Therefore, the hackers are contributing to global warming. We, dear reader, are innocent.
We are facing global climate change event, yet continue to burn resources for trivial shit like it’s 1950.
Have a problem with clear definition and evaluation function. Let LLM reduce the size of solution space. LLMs are very good at pattern reconstruction, and if the solution has a similar pattern to what was known before, it can work very well.
In this case the problem is a specific type of security vulnerability and the evaluator is the expert. This is similar in spirit to other recent endeavors where LLMs are used in genetic optimization; on a different scale.
Here’s an interesting read on “Mathematical discoveries from program search with large language models” which was I believe was also featured in HN the past:
https://www.nature.com/articles/s41586-023-06924-6
One small note, concluding that the LLM is “reasoning” about code just _based on this experiment_ is bit of a stretch IMHO.
[1] https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...
[0] https://security.googleblog.com/2024/11/leveling-up-fuzzing-...
[1] https://googleprojectzero.blogspot.com/2024/10/from-naptime-...
What the post says is "Understanding the vulnerability requires reasoning about concurrent connections to the server, and how they may share various objects in specific circumstances. o3 was able to comprehend this and spot a location where a particular object that is not referenced counted is freed while still being accessible by another thread. As far as I'm aware, this is the first public discussion of a vulnerability of that nature being found by a LLM."
The point I was trying to make is that, as far as I'm aware, this is the first public documentation of an LLM figuring out that sort of bug (non-trivial amount of code, bug results from concurrent access to shared resources). To me at least, this is an interesting marker of LLM progress.
Certainly for Govt agencies and others this will not be a factor. It is just for everyone else. This will cause people to use other models and agents without these restrictions.
It is safe to assume that a large number of vulnerabilities exist in important software all over the place. Now they can be found. This is going to set off arms race game theory applied to computer security and hacking. Probably sooner than expected...
But what if there's a missing piece of the puzzle that the author and devs missed or assumed o3 covered, but in fact was out of o3's context, that would invalidate this vulnerability?
I'm not saying there is, nor am I going to take the time to do the author's work for them, rather I am saying this report is not fully validated which feels like a dangerous precedent to set with what will likely be an influential blog post in the LLM VR space moving forward.
IMO the idea of PoC || GTFO should be applied more strictly than ever before to any vulnerability report generated by a model.
The underlying perspective that o3 is much better than previous or other current models still remains, and the methodology is still interesting. I understand the desire and need to get people to focus on something by wording it a specific way, it's the clickbait problem. But dammit, do better. Build a PoC and validate your claims, don't be lazy. If you're going to write a blog post that might influence how vulnerability researchers conduct their research, you should promote validation and not theoretical assumption. The alternative is the proliferation of ignorance through false-but-seemingly-true reporting, versus deepening the community's understanding of a system through vetted and provable reports.
1) If it is actually a UAF or if there is some other mechanism missing from the context that prevents UAF. 2) The category and severity of the vulnerability. Is it even a DoS, RCE, or is the only impact causing a thread to segfault?
This is all part of the standard vulnerability research process. I'm honestly surprised it got merged in without a PoC, although with high profile projects even the suggestion of a vulnerability in code that can clearly be improved will probably end up getting merged.
> I tried to strongly guide it to not report false positives, and to favour not reporting any bugs over reporting false positives. I have no idea if this helps, but I’d like it to help, so here we are. In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.
Q1: Who is using ksmbd in production?
Q2: Why?
2. Samba performance sucks (by comparison) which is why people still regularly deploy Windows for file sharing in 2025.
Anybody know if this supports native Windows-style ACLs for file permissions? That is the last remaining reason to still run Solaris but I think it relies on ZFS to do so.
Samba's reliance on Unix UID/GID and the syncing as part of its security model is still stuck in the 1970s unfortunately.
The caveat is the in-kernel SMB server has been the source of at least one holy-shit-this-is-bad zero-day remote root hole in Windows (not sure about Solaris) so there are tradeoffs.
Sigh. This is why we can't have nice things
Like yeah having smb in kernel is faster but honestly it's not fundamentally faster. But it seems the will to make samba better isn't there
You could in theory automate the entire process, treat the LLM as a very advanced fuzzer. Run it against your target in one or more VMs. If the VM crashes or otherwise exhibits anomalous behavior, you've found something. (Most exploits like this will crash the machine initially, before you refine them.)
On one hand: great application for LLMs.
On the other hand: conversely implies that demonstrating this doesn't mean that much.
(Also yeah feels like the "FIRST!!1!eleven" thing metastasized from comment sections into C-level executives…)
This is likely because the author didn't give Claude a scratchpad or space to think, essentially forcing it to mix its thoughts with its report. I'd be interested to see if using the official thinking mechanism gives it enough space to get differing results.
[0] https://arxiv.org/pdf/2201.11903
[1] https://docs.anthropic.com/en/docs/build-with-claude/extende...
You've got all the elements for a successful optimization algorithm: 1) A fast and good enough sampling function + 2) a fairly good energy function.
For 1) this post shows that LLMs (even unoptimized) are quite good at sampling candidate vulnerabilities in large code bases. A 1% accuracy rate isn't bad at all, and they can be made quite fast (at least very parallelizable).
For 2) theoretically you can test any exploit easily and programmatically determine if it works. The main challenge is getting the energy function to provide gradient—some signal when you're close to finding a vulnerability/exploit.
I expect we'll see such a system within the next 12 months (or maybe not, since it's the kind of system that many lettered agencies would be very interested in).
I recently found a pretty serious security vulnerability in an open source very niche server I sometimes use. This took virtually no effort using LLMs. I'm worried that there is a huge long tail of software out there which wasn't worth finding vulnerabilities in for nefarious means manually but if it was automated could lead to really serious problems.
I wouldn't (personally) call it an alignment issue, as such.
I agree after time you end up with a steady state but in the short medium term the attackers have a huge advantage.
https://lwn.net/Articles/871866/ This is also nothing to do with Samba which is a well trodden path.
So why not attack a codebase that is rather more heavily used and older? Why not go for vi?
4 years after the article, does any relevant distro have that implementation enabled?
But this poster actually understands the AI output and is able to find real issues (in this case, use-after-free). From the article:
> Before I get into the technical details, the main takeaway from this post is this: with o3 LLMs have made a leap forward in their ability to reason about code, and if you work in vulnerability research you should start paying close attention. If you’re an expert-level vulnerability researcher or exploit developer the machines aren’t about to replace you. In fact, it is quite the opposite: they are now at a stage where they can make you significantly more efficient and effective.
This is I suppose an area where the engineer can apply their expertise to build a validation rig that the LLM may be able to utilize.
rest of the link is tracking to my (limited) understanding
I think the NSA already has this, without the need for a LLM.
...if your linux kernel has ksmbd built into it; that's a much smaller interest group
> o3 finds the kerberos authentication vulnerability in 8 of the 100 runs
And I'd guess this only became a blog post because the author already knew about the vuln and was just curious to see if the intern could spot it too, given a curated subset of the codebase