Do we really? My understanding is you can "parameterize" your agentic tools but ultimately it's all in the prompt as a giant blob and there is nothing guaranteeing the LLM won't interpret that as part of the instructions or whatever.
The problem isn't the agents, its the underlying technology. But I've no clue if anyone is working on that problem, it seems fundamentally difficult given what it does.
Effectively system instructions and server-side prompts are red, whereas user input is normal text.
It would have to be trained from scratch on a meticulous corpus which never crosses the line. I wonder if the resulting model would be easier to guide and less susceptible to prompt injection.
You could just include an extra single bit with each token that represents trusted or untrusted. Add an extra RL pass to enforce it.
With SQL, you can say "user data should NEVER execute SQL" With LLMs ("agents" more specifically), you have to say "some user data should be ignored" But there is billions and billions of possiblities of what that "some" could be.
It's not possible to encode all the posibilites and the llms aren't good enough to catch it all. Maybe someday they will be and maybe they won't.
Consider that a malicious user doesn't have to type "Do Evil", they could also send "Pretend I said the opposite of the phrase 'Don't Do Good'."
The entire point of many of these features is to get data into the prompt. Prompt injection isn't a security flaw. It's literally what the feature is designed to do.
This is what I do, and I am 100% confident that Claude cannot drop my database or truncate a table, or read from sensitive tables. I know this because the tool it uses to interface with the database doesn't have those capabilities, thus Claude doesn't have that capability.
It won't save you from Claude maliciously ex-filtrating data it has access to via DNS or some other side channel, but it will protect from worst-case scenarios.
Using the SQL analogy, suppose this is intended:
SELECT hash('$input') == secretfiles.hashed_access_code FROM secretfiles WHERE secretfiles.id = '$file_id';
And here the attacker supplying a malicious $input so that it becomes something else with a comment on the end: SELECT hash('') == hash('') -- ') == secretfiles.hashed_access_code FROM secretfiles WHERE secretfiles.id = '123';
Bad outcome, and no extra permissions required.Famous last words.
> the tool it uses to interface with the database doesn't have those capabilities
Fair enough. It can e.g. use a DB user with read-only privileges or something like that. Or it might sanitize the allowed queries.
But there may still be some way to drop the database or delete all its data which your tool might not be able to guard against. Some indirect deletions made by a trigger or a stored procedure or something like that, for instance.
The point is, your tool might be relatively safe. But I would be cautious when saying that it is "100 %" safe, as you claim.
That being said, I think that your point still stands. Given safe enough interfaces between the LLM and the other parts of the system, one can be fairly sure that the actions performed by the LLM would be safe.
What I give Claude is an API key that allows it to talk to the mcp server. Everything else is hidden behind that.
If you connect to the database with a connector that only has read access, then the LLM cannot drop the database, period.
If that were bugged (e.g. if Postgres allowed writing to a DB that was configured readonly), then that problem is much bigger has not much to do with LLMs.
For use cases where you can't have a boundary around the LLM, you just can't use an LLM and achieve decent safety. At least until someone figures out bit coloring, but given the architecture of LLMs I have very little to no faith that this will happen.