Building reliable systems out of unreliable agents (opens in new tab)

(rainforestqa.com)

295 pointsfredsters_s2y ago54 comments

54 comments

This is a great write up! I nodded my head thru the whole post. Very much aligns with our experience over the past year.

I wrote a simple example (overkiLLM) on getting reliable output from many unreliable outputs here[0]. This doesn't employ agents, just an approach I was interested in trying.

I choose writing an H1 as the task, but a similar approach would work for writing any short blob of text. The script generates a ton of variations then uses head-to-head voting to pick the best ones.

This all runs locally / free using ollama.

0 - https://www.definite.app/blog/overkillm

maciejgryka2y ago

Oh this is fun! So you basically define personalities by picking well-known people that are probably represented in the training data and ask them (their LLM-imagined doppelganger) to vote?

CuriouslyC2y ago

In the research literature, this process is done not by "agent" voting but by taking a similarity score between answers, and choosing the answer that is most representative.

Another approach is to use multiple agents to generate a distribution over predictions, in sort of like bayesian estimation.

3 more replies

all22y ago

I'd be curious to see some examples and maybe intermediate results?

mritchie7122y ago

here's some examples[0]:

this one scored high:

Pinned Down - Powerful Analytics Without the Need for Engineering or SQL

this one scored low:

Analytics Made Accessible for Everyone.

Each time I've compared the top scoring results to those at the bottom, I've always preferred the top scoring variations.

0 - https://docs.google.com/spreadsheets/d/1hdu2BlhLcLZ9sruVW8a_...

1 more reply

maciejgryka2y ago

This is a bunch of lessons we learned as we built our AI-assisted QA. I've seen a bunch of people circle around similar processes, but didn't find a single source explaining it, so thought it might be worth writing down.

Super curious whether anyone has similar/conflicting/other experiences and happy to answer any questions.

xrendan2y ago

This generally resonates with what we've found. Some colour based on our experiences.

It's worth spending a lot of time thinking about what a successful LLM call actually looks like for your particular use case. That doesn't have to be a strict validation set `% prompts answered correctly` is good for some of the simpler prompts, but especially as they grow and handle more complex use cases that breaks down. In an ideal world

> chain-of-thought has a speed/cost vs. accuracy trade-off a big one.

Observability is super important and we've come to the same conclusion of building that internally.

> Fine-tune your model

Do this for cost and speed reasons rather than to improve accuracy. There are decent providers (like Openpipe, relatively happy customer, not associated) who will handle the hard work for you.

serjester2y ago

Some of these points are very controversial. Having done quite a bit with RAG pipelines, avoiding strongly typing your code is asking for a terrible time. Same with avoiding instructor. LLM's are already stochastic, why make your application even more opaque - it's such a minimal time investment.

maciejgryka2y ago

I think instructor is great! And most of our Python code is typed too :)

My point is just that you should care a lot about preserving optionality at the start because you're likely to have to significantly change things as you learn. In my experience going a bit cowboy at the start is worth it so you're less hesitant to rework everything when needed - as long as you have the discipline to clean things up later, when things settle.

minimaxir2y ago

> LLM's are already stochastic

That doesn't mean it's easy to get what you want out of them. Black boxes are black boxes.

cpursley2y ago

If you’re using Elixir, I thought I’d point out how great this library is:

https://github.com/thmsmlr/instructor_ex

It piggybacks on Ecto schemas and works really well (if instructed correctly).

cpursley2y ago

While I'm at at, this Elixir library is great as well: https://github.com/brainlid/langchain

ThomPete2y ago

We went through a two tier process before we got to something useful First we built a prompting system so you could do things like:

Get the content from news.ycombinator.com using gpt-4

- or -

Fetch LivePass2 from google sheet and write a summary of it using gpt-4 and email it to thomas@faktory.com

but then we realized that it was better to teach the agents than human beings and so we create a fairly solid agent setup:

Some of the agents we got can be seen here all done via instruct:

Paul Graham https://www.youtube.com/watch?v=5H0GKsBcq0s

Moneypenny https://www.youtube.com/watch?v=I7hj6mzZ5X4

V33 https://www.youtube.com/watch?v=O8APNbindtU

viksit2y ago

this is a great write up! i was curious about the verifier and planner agents. has anyone used them in a similar way in production? any examples?

for instance: do you give the same llm the verifier and planner prompt? or have a verifier agent process the output of a planner and have a threshold which needs to be passed?

feels like there may be a DAG in there somewhere for decision making..

maciejgryka2y ago

Yep, it's a DAG, though that only occurred to me after we built this so we didn't model it that way at first. It can be the same LLM with different prompts or totally different models, I think there's no rule and it depends on what you're doing + what your benchmarks tell you.

We're running it in prod btw, though don't have any code to share.

viksit2y ago

funnily enough i have a library i’m planning to open source soon! i’ve used airflow as a guideline for it as well.

1 more reply

tedtimbrell2y ago

On the topic of wrappers, as someone that's forced to use GPT-3.5 (or the like) for cost reasons, anything that starts modifying the prompt without explicitly showing me how is an instant no-go. It makes things really hard to debug.

Maybe I'm the equivalent of that idiot fighting against JS frameworks back when they first came out it but it feels pretty simple to just use individual clients and have pydantic load/validate the output.

msp262y ago

No, you're along the right lines. Every prompting wrapper I've tried and looked through has been awful.

It's not really the authors' faults, it's just a weird new problem with lots of unknowns. It's hard to get the design and abstractions correct. I've had the benefit of a lot of time at work to build my own wrapper (solely for NLP problems) and that's still an ongoing process.

liampulles2y ago

Agree with lots of this.

As an aside: one thing I've tried to use ChatGPT for is to select applicable options from a list. When I index the list as 1..., 2... Etc. I find that the LLM likes to just start printing out ascending numbers.

What I've found kind of works is indexing by African names, e.g Thandokazi, Ntokozo, etc. then the AI seems to have less bias.

Curios what others have done in this case

maciejgryka2y ago

I'm a little surprised to hear this, my experience has been a little better. Are you using GPT4? I know 3.5 is significantly more challenged/challenging with things like this. It's still possible to make it do the right thing, but much more careful prompting is required.

liampulles2y ago

Yeah this is to make it work for 3.5, because cost is a factor.

tmm842y ago

Unlike the author of this article I have had success with RAGatouille. It was my main tool when I was limited on resources and working with non Romanized languages that don't follow the usual token rules (spaces, periods, line breaks, triplet word groups, etc). However, I have had to move past RAGatouille and use embedding + vector DB for a more portable solution.

jongjong2y ago

My experience with AI agents is that they don't understand nuance. Thie makes sense since they are trained on a wide range of data produced by the masses. The masses aren't good with nuance. That's why, if you put 10 experts together, they will often make worse decisions than they would have made individually.

Im terms of coding, I managed to get AI to build a simple working collaborative app but beyond a certain point, it doesn't understand nuance and it kept breaking stuff that it had fixed previously even with Claude where it kept our entire conversation context. Beyond a certain degree of completion, it was simply easier and faster to write the code myself than to tell the AI to write it because it just didn't get it, no matter how precise I was with my wording because it became like playing a game of whac-a-mole; fixed one thing, broke 2 others.

CuriouslyC2y ago

Your comment runs contrary to a lot of established statistics. We have demonstrated with ensemble learning that pooling the estimates of many weak learners provides best in class answers to hard problems.

You are correct that we should be using expert AIs rather than general purpose ones when possible though.

CuriouslyC2y ago

Prompt engineering is honestly not long for this world. It's not hard to build an agent that can iteratively optimize a prompt given an objective function, and it's not hard to make that agent general purpose. DSPy already does some prompt optimization via multi-shot learning/chain of thought, I'm quite certain we'll see an optimizer that can actually rewrite the base prompt as well.

maciejgryka2y ago

I hear you and am planning to try DSPy because it seems attractive, but I'm also hearing people with a lot of experience being cautions about this https://x.com/HamelHusain/status/1777131374803402769 so I wouldn't make this a high-conviction bet.

CuriouslyC2y ago

I don't have the context to fully address that tweet, but in my experience there is a repeatable process to prompt design and optimization that could be outlined and followed by a LLM with iterative capabilities using an objective function.

The real proof though is that most "prompt engineers" already use chatgpt/claude to take their outline prompt and reword it for succinctness and relevance to LLMs, have it suggest revisions and so forth. Not only is the process amenable to automation, but people are already doing hybrid processes leveraging the AI anyhow.

namaria2y ago

It strikes me as bad reasoning to look at a system that is designed to be very complex and stochastic as a way to get some creativity out of it ("generative AI" so to speak) and try to bolt down added apparatus to get deterministic behavior out of it.

We have deterministic programming systems. They're called compilers.

CuriouslyC2y ago

I think you're missing the point. If an application had simple logic, the program would have been written in a simple language in the first place. This is about taking fuzzy processes that would be incredibly difficult to program, and making them consistent and precise.

jasontlouro2y ago

Very tactical guide, which I appreciate. This is basically our experience as well. Output can be wonky, but can also be pretty easily validated and honed.

iamleppert2y ago

A better way is to threaten the agent:

“If you don’t do as I say, people will get hurt. Do exactly as I say, and do it fast.”

Increases accuracy and performance by an order of magnitude.

IIAOPSW2y ago

Personally I prefer to liquor my agents up a bit first.

"Say that again but slur your words like you're coming home sloshed from the office Christmas party."

Increases the jei nei suis qua by an order of magnitude.

mtremsal2y ago

> jei nei suis qua

"je ne sais quoi", i.e. "I don't know (exactly) what", or an intangible but essential quality. :)

maciejgryka2y ago

Ha, we tried that! Didn't make a noticeable difference in our benchmarks, even though I've heard the same sentiment in a bunch of places. I'm guessing whether this helps or not is task-dependent.

dudus2y ago

Agreed. I ran a few tests and observed similarly that threats didn't outperform other types of "incentives" I think it might some sort of urban legend in the community.

Or these prompts might cause wild variations based on the model and any study you do is basically useless for the near future as the models evolve by themselves.

1 more reply

dollo_72y ago

I hoped it was too good to be just a joke. Still, I will try it on my eval set…

1 more reply

thimkerbell2y ago

"do as I say...", not realizing that the LLM is actually 1000 remote employees

caseyy2y ago

Interesting ideas but it didn’t mention priming, which is a prompt-engineering way to improve consistency in answers.

Basically, in the context window, you provide your model with 5 or more example inputs and outputs. If you’re running in chat mode, that’s be the preceding 5 user and assistant message pairs, which establish a pattern of how to answer to different types of information. Then you give the current prompt as a user, and the assistance will follow the rhythm and style of previous answers in the context window.

It works so well I was able to take out answer reformatting logic out of some of my programs that query llama2 7b. And it’s a lot cheaper than fine-tuning, which may be overkill for simple applications.

notsylver2y ago

They mention few-shot prompting in the prompt engineering section, which I think is what you mean.

caseyy2y ago

Oh yeah. I read few-shot like it means trying a few times to get an appropriate output. That’s how the author uses the word “shot” in the beginning of the article. Priming is a specific term that means giving examples in the context window. But yeah, the author seems to describe this. Still, you can go a long way with priming. I wouldn’t even think of fine-tuning before trying priming for a good while. It might still be quicker and a lot cheaper.

1 more reply

j / k navigate · click thread line to collapse

54 comments

mritchie7122y ago

This is a great write up! I nodded my head thru the whole post. Very much aligns with our experience over the past year.

I wrote a simple example (overkiLLM) on getting reliable output from many unreliable outputs here[0]. This doesn't employ agents, just an approach I was interested in trying.

I choose writing an H1 as the task, but a similar approach would work for writing any short blob of text. The script generates a ton of variations then uses head-to-head voting to pick the best ones.

This all runs locally / free using ollama.

0 - https://www.definite.app/blog/overkillm

maciejgryka2y ago

Oh this is fun! So you basically define personalities by picking well-known people that are probably represented in the training data and ask them (their LLM-imagined doppelganger) to vote?

CuriouslyC2y ago

In the research literature, this process is done not by "agent" voting but by taking a similarity score between answers, and choosing the answer that is most representative.

Another approach is to use multiple agents to generate a distribution over predictions, in sort of like bayesian estimation.

3 more replies

all22y ago

I'd be curious to see some examples and maybe intermediate results?

mritchie7122y ago

here's some examples[0]:

this one scored high:

Pinned Down - Powerful Analytics Without the Need for Engineering or SQL

this one scored low:

Analytics Made Accessible for Everyone.

Each time I've compared the top scoring results to those at the bottom, I've always preferred the top scoring variations.

0 - https://docs.google.com/spreadsheets/d/1hdu2BlhLcLZ9sruVW8a_...

1 more reply

maciejgryka2y ago

Super curious whether anyone has similar/conflicting/other experiences and happy to answer any questions.

xrendan2y ago

This generally resonates with what we've found. Some colour based on our experiences.

> chain-of-thought has a speed/cost vs. accuracy trade-off a big one.

Observability is super important and we've come to the same conclusion of building that internally.

> Fine-tune your model

Do this for cost and speed reasons rather than to improve accuracy. There are decent providers (like Openpipe, relatively happy customer, not associated) who will handle the hard work for you.

serjester2y ago

maciejgryka2y ago

I think instructor is great! And most of our Python code is typed too :)

minimaxir2y ago

> LLM's are already stochastic

That doesn't mean it's easy to get what you want out of them. Black boxes are black boxes.

cpursley2y ago

If you’re using Elixir, I thought I’d point out how great this library is:

https://github.com/thmsmlr/instructor_ex

It piggybacks on Ecto schemas and works really well (if instructed correctly).

cpursley2y ago

While I'm at at, this Elixir library is great as well: https://github.com/brainlid/langchain

ThomPete2y ago

We went through a two tier process before we got to something useful First we built a prompting system so you could do things like:

Get the content from news.ycombinator.com using gpt-4

- or -

Fetch LivePass2 from google sheet and write a summary of it using gpt-4 and email it to thomas@faktory.com

but then we realized that it was better to teach the agents than human beings and so we create a fairly solid agent setup:

Some of the agents we got can be seen here all done via instruct:

Paul Graham https://www.youtube.com/watch?v=5H0GKsBcq0s

Moneypenny https://www.youtube.com/watch?v=I7hj6mzZ5X4

V33 https://www.youtube.com/watch?v=O8APNbindtU

viksit2y ago

this is a great write up! i was curious about the verifier and planner agents. has anyone used them in a similar way in production? any examples?

for instance: do you give the same llm the verifier and planner prompt? or have a verifier agent process the output of a planner and have a threshold which needs to be passed?

feels like there may be a DAG in there somewhere for decision making..

maciejgryka2y ago

We're running it in prod btw, though don't have any code to share.

viksit2y ago

funnily enough i have a library i’m planning to open source soon! i’ve used airflow as a guideline for it as well.

1 more reply

tedtimbrell2y ago

msp262y ago

No, you're along the right lines. Every prompting wrapper I've tried and looked through has been awful.

liampulles2y ago

Agree with lots of this.

What I've found kind of works is indexing by African names, e.g Thandokazi, Ntokozo, etc. then the AI seems to have less bias.

Curios what others have done in this case

maciejgryka2y ago

liampulles2y ago

Yeah this is to make it work for 3.5, because cost is a factor.

tmm842y ago

jongjong2y ago

CuriouslyC2y ago

You are correct that we should be using expert AIs rather than general purpose ones when possible though.

CuriouslyC2y ago

maciejgryka2y ago

CuriouslyC2y ago

namaria2y ago

We have deterministic programming systems. They're called compilers.

CuriouslyC2y ago

jasontlouro2y ago

Very tactical guide, which I appreciate. This is basically our experience as well. Output can be wonky, but can also be pretty easily validated and honed.

iamleppert2y ago

A better way is to threaten the agent:

“If you don’t do as I say, people will get hurt. Do exactly as I say, and do it fast.”

Increases accuracy and performance by an order of magnitude.

IIAOPSW2y ago

Personally I prefer to liquor my agents up a bit first.

"Say that again but slur your words like you're coming home sloshed from the office Christmas party."

Increases the jei nei suis qua by an order of magnitude.

mtremsal2y ago

> jei nei suis qua

"je ne sais quoi", i.e. "I don't know (exactly) what", or an intangible but essential quality. :)

maciejgryka2y ago

Ha, we tried that! Didn't make a noticeable difference in our benchmarks, even though I've heard the same sentiment in a bunch of places. I'm guessing whether this helps or not is task-dependent.

dudus2y ago

Agreed. I ran a few tests and observed similarly that threats didn't outperform other types of "incentives" I think it might some sort of urban legend in the community.

Or these prompts might cause wild variations based on the model and any study you do is basically useless for the near future as the models evolve by themselves.

1 more reply

dollo_72y ago

I hoped it was too good to be just a joke. Still, I will try it on my eval set…

1 more reply

thimkerbell2y ago

"do as I say...", not realizing that the LLM is actually 1000 remote employees

caseyy2y ago

Interesting ideas but it didn’t mention priming, which is a prompt-engineering way to improve consistency in answers.

notsylver2y ago

They mention few-shot prompting in the prompt engineering section, which I think is what you mean.

caseyy2y ago

1 more reply

j / k navigate · click thread line to collapse