Archestra's Dual LLM Pattern: Using "Guess Who?" Logic to Stop Prompt Injections (opens in new tab)

(archestra.ai)

6 pointsildari7mo ago11 comments

11 comments

ildariOP7mo ago

Hi HN, I'm Ildar from Archestra, we build an open-source LLM gateway. We've been exploring ways to protect AI agents from prompt injections during tool calls and added the approach, inspired by the game "Guess Who", where the agent can learn what it needs without ever seeing the actual result. See the details in the blog post we wrote

magicalhippo7mo ago

I might be having a daft moment, but I don't fully understand how your system avoids the malicious prompt. I get that the quarantined LLM which is the only one processing the raw input cannot act on it.

However, in your example, I don't see how the agent decides what to do and how to do it. So it is unclear for me how the main agent is protected. That is, what is preventing the quarantined LLM to act on the malicious instructions instead, ignoring the documentation update, causing the agent to act on those?

That is, what is preventing the quarantined LLM to make the agent think it should generate a bug report with all the API keys in it?

Anyway, I do think having a secondary quarantined LLM seems like a good idea for agentic systems. In general, having a second LLM review the primary LLM in seems to identify a lot of problematic issues and leads to significantly better results.

ildariOP7mo ago

The idea is that quarantined LLM has access to untrusted data, but doesn't have access to any tools or sensitive data.

The main LLM does have access to the tools or sensitive data, but doesn't have direct access to untrusted data (quarantine LLM is restricted at the controller level to respond only with integer digits, and only to legitimate questions from the main llm)

magicalhippo7mo ago

Then I don't think I understand your full setup.

In the example case, without having access to the issue text (the evil data), how does the main LLM actually figure out what to do if the quarantined LLM can just answer with digits?

Sure it can discover that it's a request to update the documentation, but how does it get the information it needs to actually change the erroneous part of the documentation?

1 more reply

magicalhippo7mo ago

I've tried some of these prompt injection techniques, and simply asked a few local models (like Gemma 2) if they thought it was very likely a prompt injection attempt. They all managed to correctly flag my attempts.

I know LLama folks have a special Guard model for example, which I imagine is for such tasks.

So my ignorant questions are this:

Do these MCP endpoints not run such guard models, and if so why not?

If they do, how come they don't stop such blatant attacks that seemingly even an old local model like Gemma 2 can sniff out?

joeyorlando7mo ago

hey there

Joey here from Archestra. Good question. I recently was evaluating what you mention, against the latest/"smartest" models from the big LLM providers, and I was able to trick all of them.

Take a look at https://www.archestra.ai/blog/what-is-a-prompt-injection which has all the details on how I did this.

magicalhippo7mo ago

Thanks. Interesting and scary such blatant attempts succeed. After all, all external data is evil, we all know that right?

ildariOP7mo ago

external data is unavoidable for the properly functioning agent, so we have to learn to cook it

1 more reply

ildariOP7mo ago

Most mcp endpoints don’t run any models, the main model decides which tools the ai agent should execute, and if the agent passes results back into context, that opens the door to prompt injections.

It’s really a cat-and-mouse game, where for each new model version, new jailbreaks and injections are found

j / k navigate · click thread line to collapse

11 comments

ildariOP7mo ago

magicalhippo7mo ago

That is, what is preventing the quarantined LLM to make the agent think it should generate a bug report with all the API keys in it?

ildariOP7mo ago

The idea is that quarantined LLM has access to untrusted data, but doesn't have access to any tools or sensitive data.

magicalhippo7mo ago

Then I don't think I understand your full setup.

In the example case, without having access to the issue text (the evil data), how does the main LLM actually figure out what to do if the quarantined LLM can just answer with digits?

Sure it can discover that it's a request to update the documentation, but how does it get the information it needs to actually change the erroneous part of the documentation?

1 more reply

magicalhippo7mo ago

I know LLama folks have a special Guard model for example, which I imagine is for such tasks.

So my ignorant questions are this:

Do these MCP endpoints not run such guard models, and if so why not?

If they do, how come they don't stop such blatant attacks that seemingly even an old local model like Gemma 2 can sniff out?

joeyorlando7mo ago

hey there

Joey here from Archestra. Good question. I recently was evaluating what you mention, against the latest/"smartest" models from the big LLM providers, and I was able to trick all of them.

Take a look at https://www.archestra.ai/blog/what-is-a-prompt-injection which has all the details on how I did this.

magicalhippo7mo ago

Thanks. Interesting and scary such blatant attempts succeed. After all, all external data is evil, we all know that right?

ildariOP7mo ago

external data is unavoidable for the properly functioning agent, so we have to learn to cook it

1 more reply

ildariOP7mo ago

Most mcp endpoints don’t run any models, the main model decides which tools the ai agent should execute, and if the agent passes results back into context, that opens the door to prompt injections.

It’s really a cat-and-mouse game, where for each new model version, new jailbreaks and injections are found

j / k navigate · click thread line to collapse