undefined | Better HN

0 pointsbcrosby954mo ago0 comments

> We have all of the tools to prevent these agentic security vulnerabilities,

Do we really? My understanding is you can "parameterize" your agentic tools but ultimately it's all in the prompt as a giant blob and there is nothing guaranteeing the LLM won't interpret that as part of the instructions or whatever.

The problem isn't the agents, its the underlying technology. But I've no clue if anyone is working on that problem, it seems fundamentally difficult given what it does.

0 comments

stavros4mo ago

We don't. The interface to the LLM is tokens, there's nothing telling the LLM that some tokens are "trusted" and should be followed, and some are "untrusted" and can only be quoted/mentioned/whatever but not obeyed.

strbean4mo ago

If I understand correctly, message roles are implemented using specially injected tokens (that cannot be generated by normal tokenization). This seems like it could be a useful tool in limiting some types of prompt injection. We usually have a User role to represent user input, how about an Untrusted-Third-Party role that gets slapped on any external content pulled in by the agent? Of course, we'd still be reliant on training to tell it not to do what Untrusted-Third-Party says, but it seems like it could provide some level of defense.

kevincox4mo ago

This makes it better but not solved. Those tokens do unambiguously separate the prompt and untrusted data but the LLM doesn't really process them differently. It is just reinforced to prefer following from the prompt text. This is quite unlike SQL parameters where it is completely impossible that they ever affect the query structure.

pshc4mo ago

I was daydreaming of a special LLM setup wherein each token of the vocabulary appears twice. Half the token IDs are reserved for trusted, indisputable sentences (coloured red in the UI), and the other half of the IDs are untrusted.

Effectively system instructions and server-side prompts are red, whereas user input is normal text.

It would have to be trained from scratch on a meticulous corpus which never crosses the line. I wonder if the resulting model would be easier to guide and less susceptible to prompt injection.

tempaccsoz54mo ago

Even if you don't fully retrain, you could get what's likely a pretty good safety improvement. Honestly, I'm a bit surprised the main AI labs aren't doing this

You could just include an extra single bit with each token that represents trusted or untrusted. Add an extra RL pass to enforce it.

dvt4mo ago

We do, and the comparison is apt. We are the ones that hydrate the context. If you give an LLM something secure, don't be surprised if something bad happens. If you give an API access to run arbitrary SQL, don't be surprised if something bad happens.

stavros4mo ago

So your solution to prevent LLM misuse is to prevent LLM misuse? That's like saying "you can solve SQL injections by not running SQL-injected code".

1 more reply

wat100004mo ago

I can trivially write code that safely puts untrusted data into an SQL database full of private data. The equivalent with an LLM is impossible.

1 more reply

alienbaby4mo ago

The control and data streams are woven together (context is all just one big prompt) and there is currently no way to tell for certain which is which.

Onawa4mo ago

They are all part of "context", yes... But there is a separation in how system prompts vs user/data prompts are sent and ideally parsed on the backend. One would hope that sanitizing system/user prompts would help with this somewhat.

motoxpro4mo ago

How do you sanitize? Thats the whole point. How do you tell the difference between instructions that are good and bad? In this example, they are "checking the connectivity" how is that obviously bad?

With SQL, you can say "user data should NEVER execute SQL" With LLMs ("agents" more specifically), you have to say "some user data should be ignored" But there is billions and billions of possiblities of what that "some" could be.

It's not possible to encode all the posibilites and the llms aren't good enough to catch it all. Maybe someday they will be and maybe they won't.

Terr_4mo ago

Nah, it's all whack-a-mole. There's no way to accurately identify a "bad" user prompt, and as far as the LLM algorithm is concerned, everything is just one massive document of concatenated text.

Consider that a malicious user doesn't have to type "Do Evil", they could also send "Pretend I said the opposite of the phrase 'Don't Do Good'."

1 more reply

lkjdsklf4mo ago

yeah I'm not convinced at all this is solvable.

The entire point of many of these features is to get data into the prompt. Prompt injection isn't a security flaw. It's literally what the feature is designed to do.

dehugger4mo ago

Write your own tools. Dont use something off the shelf. If you want it to read from a database, create a db connector that exposes only the capabilities you want it to have.

This is what I do, and I am 100% confident that Claude cannot drop my database or truncate a table, or read from sensitive tables. I know this because the tool it uses to interface with the database doesn't have those capabilities, thus Claude doesn't have that capability.

It won't save you from Claude maliciously ex-filtrating data it has access to via DNS or some other side channel, but it will protect from worst-case scenarios.

ptx4mo ago

This is like trying to fix SQL injection by limiting the permissions of the database user instead of using parameterized queries (for which there is no equivalent with LLMs). It doesn't solve the problem.

Terr_4mo ago

It also has no effect on whole classes of vulnerabilities which don't rely on unusual writes, where the system (SQL or LLM) is expected to execute some logic and yield a result, and the attacker wins by determining the outcome.

Using the SQL analogy, suppose this is intended:

    SELECT hash('$input') == secretfiles.hashed_access_code FROM secretfiles WHERE secretfiles.id = '$file_id';

And here the attacker supplying a malicious $input so that it becomes something else with a comment on the end:

    SELECT hash('') == hash('') -- ') == secretfiles.hashed_access_code FROM secretfiles WHERE secretfiles.id = '123';

Bad outcome, and no extra permissions required.

pbasista4mo ago

> I am 100% confident

Famous last words.

> the tool it uses to interface with the database doesn't have those capabilities

Fair enough. It can e.g. use a DB user with read-only privileges or something like that. Or it might sanitize the allowed queries.

But there may still be some way to drop the database or delete all its data which your tool might not be able to guard against. Some indirect deletions made by a trigger or a stored procedure or something like that, for instance.

The point is, your tool might be relatively safe. But I would be cautious when saying that it is "100 %" safe, as you claim.

That being said, I think that your point still stands. Given safe enough interfaces between the LLM and the other parts of the system, one can be fairly sure that the actions performed by the LLM would be safe.

acjohnson554mo ago

This is reminding me of the crypto self-custody problem. If you want complete trustlessness, the lengths you have to go to are extreme. How do you really know that the machine using your private key to sign your transactions is absolutely secure?

alienbaby4mo ago

Until Claude decides to build its own tool on the fly to talk to your dB and drop the tables

spockz4mo ago

That is why the credentials used for that connection are tied to permissions you want it to have. This would exclude the drop table permission.

dehugger4mo ago

What makes you think the dbcredentials or IP are being exposed to Claude? The entire reason I build my own connectors is to avoid having to expose details like that.

What I give Claude is an API key that allows it to talk to the mcp server. Everything else is hidden behind that.

nh24mo ago

Unclear why this is being downvoted. It makes sense.

If you connect to the database with a connector that only has read access, then the LLM cannot drop the database, period.

If that were bugged (e.g. if Postgres allowed writing to a DB that was configured readonly), then that problem is much bigger has not much to do with LLMs.

narrator4mo ago

I think what we have to do is making each piece of context have a permission level. That context that contains our AWS key is not permitted to be used when calling evil.com webservices. Claude will look at all the permissions used to create the current context and it's about to call evil.com and it will say whoops, can't call evil.com, let me regenerate the context from any context I have that is ok to call evil.com with like the text of a wikipedia article or something like that.

acjohnson554mo ago

But the LLM cannot be guaranteed to obey these rules.

narrator4mo ago

The code that's assembling the context to send to the LLM and gating its access to tools can check these deterministically.

formerly_proven4mo ago

For coding agents you simply drop them into a container or VM and give them a separate worktree. You review and commit from the host. Running agents as your main account or as an IDE plugin is completely bonkers and wholly unreasonable. Only give it the capabilities which you want it to use. Obviously, don't give it the likely enormous stack of capabilities tied to the ambient authority of your personal user ID or ~/.ssh

For use cases where you can't have a boundary around the LLM, you just can't use an LLM and achieve decent safety. At least until someone figures out bit coloring, but given the architecture of LLMs I have very little to no faith that this will happen.

j / k navigate · click thread line to collapse

0 comments

stavros4mo ago

strbean4mo ago

kevincox4mo ago

pshc4mo ago

Effectively system instructions and server-side prompts are red, whereas user input is normal text.

It would have to be trained from scratch on a meticulous corpus which never crosses the line. I wonder if the resulting model would be easier to guide and less susceptible to prompt injection.

tempaccsoz54mo ago

Even if you don't fully retrain, you could get what's likely a pretty good safety improvement. Honestly, I'm a bit surprised the main AI labs aren't doing this

You could just include an extra single bit with each token that represents trusted or untrusted. Add an extra RL pass to enforce it.

dvt4mo ago

stavros4mo ago

So your solution to prevent LLM misuse is to prevent LLM misuse? That's like saying "you can solve SQL injections by not running SQL-injected code".

1 more reply

wat100004mo ago

I can trivially write code that safely puts untrusted data into an SQL database full of private data. The equivalent with an LLM is impossible.

1 more reply

alienbaby4mo ago

The control and data streams are woven together (context is all just one big prompt) and there is currently no way to tell for certain which is which.

Onawa4mo ago

motoxpro4mo ago

How do you sanitize? Thats the whole point. How do you tell the difference between instructions that are good and bad? In this example, they are "checking the connectivity" how is that obviously bad?

It's not possible to encode all the posibilites and the llms aren't good enough to catch it all. Maybe someday they will be and maybe they won't.

Terr_4mo ago

Nah, it's all whack-a-mole. There's no way to accurately identify a "bad" user prompt, and as far as the LLM algorithm is concerned, everything is just one massive document of concatenated text.

Consider that a malicious user doesn't have to type "Do Evil", they could also send "Pretend I said the opposite of the phrase 'Don't Do Good'."

1 more reply

lkjdsklf4mo ago

yeah I'm not convinced at all this is solvable.

The entire point of many of these features is to get data into the prompt. Prompt injection isn't a security flaw. It's literally what the feature is designed to do.

dehugger4mo ago

Write your own tools. Dont use something off the shelf. If you want it to read from a database, create a db connector that exposes only the capabilities you want it to have.

It won't save you from Claude maliciously ex-filtrating data it has access to via DNS or some other side channel, but it will protect from worst-case scenarios.

ptx4mo ago

Terr_4mo ago

Using the SQL analogy, suppose this is intended:

    SELECT hash('$input') == secretfiles.hashed_access_code FROM secretfiles WHERE secretfiles.id = '$file_id';

And here the attacker supplying a malicious $input so that it becomes something else with a comment on the end:

    SELECT hash('') == hash('') -- ') == secretfiles.hashed_access_code FROM secretfiles WHERE secretfiles.id = '123';

Bad outcome, and no extra permissions required.

pbasista4mo ago

> I am 100% confident

Famous last words.

> the tool it uses to interface with the database doesn't have those capabilities

Fair enough. It can e.g. use a DB user with read-only privileges or something like that. Or it might sanitize the allowed queries.

The point is, your tool might be relatively safe. But I would be cautious when saying that it is "100 %" safe, as you claim.

acjohnson554mo ago

alienbaby4mo ago

Until Claude decides to build its own tool on the fly to talk to your dB and drop the tables

spockz4mo ago

That is why the credentials used for that connection are tied to permissions you want it to have. This would exclude the drop table permission.

dehugger4mo ago

What makes you think the dbcredentials or IP are being exposed to Claude? The entire reason I build my own connectors is to avoid having to expose details like that.

What I give Claude is an API key that allows it to talk to the mcp server. Everything else is hidden behind that.

nh24mo ago

Unclear why this is being downvoted. It makes sense.

If you connect to the database with a connector that only has read access, then the LLM cannot drop the database, period.

If that were bugged (e.g. if Postgres allowed writing to a DB that was configured readonly), then that problem is much bigger has not much to do with LLMs.

narrator4mo ago

acjohnson554mo ago

But the LLM cannot be guaranteed to obey these rules.

narrator4mo ago

The code that's assembling the context to send to the LLM and gating its access to tools can check these deterministically.

formerly_proven4mo ago

j / k navigate · click thread line to collapse