I wrote a simple example (overkiLLM) on getting reliable output from many unreliable outputs here[0]. This doesn't employ agents, just an approach I was interested in trying.
I choose writing an H1 as the task, but a similar approach would work for writing any short blob of text. The script generates a ton of variations then uses head-to-head voting to pick the best ones.
This all runs locally / free using ollama.
Another approach is to use multiple agents to generate a distribution over predictions, in sort of like bayesian estimation.
this one scored high:
Pinned Down - Powerful Analytics Without the Need for Engineering or SQL
this one scored low:
Analytics Made Accessible for Everyone.
Each time I've compared the top scoring results to those at the bottom, I've always preferred the top scoring variations.
0 - https://docs.google.com/spreadsheets/d/1hdu2BlhLcLZ9sruVW8a_...
Super curious whether anyone has similar/conflicting/other experiences and happy to answer any questions.
It's worth spending a lot of time thinking about what a successful LLM call actually looks like for your particular use case. That doesn't have to be a strict validation set `% prompts answered correctly` is good for some of the simpler prompts, but especially as they grow and handle more complex use cases that breaks down. In an ideal world
> chain-of-thought has a speed/cost vs. accuracy trade-off a big one.
Observability is super important and we've come to the same conclusion of building that internally.
> Fine-tune your model
Do this for cost and speed reasons rather than to improve accuracy. There are decent providers (like Openpipe, relatively happy customer, not associated) who will handle the hard work for you.
My point is just that you should care a lot about preserving optionality at the start because you're likely to have to significantly change things as you learn. In my experience going a bit cowboy at the start is worth it so you're less hesitant to rework everything when needed - as long as you have the discipline to clean things up later, when things settle.
That doesn't mean it's easy to get what you want out of them. Black boxes are black boxes.
https://github.com/thmsmlr/instructor_ex
It piggybacks on Ecto schemas and works really well (if instructed correctly).
Get the content from news.ycombinator.com using gpt-4
- or -
Fetch LivePass2 from google sheet and write a summary of it using gpt-4 and email it to thomas@faktory.com
but then we realized that it was better to teach the agents than human beings and so we create a fairly solid agent setup:
Some of the agents we got can be seen here all done via instruct:
Paul Graham https://www.youtube.com/watch?v=5H0GKsBcq0s
Moneypenny https://www.youtube.com/watch?v=I7hj6mzZ5X4
for instance: do you give the same llm the verifier and planner prompt? or have a verifier agent process the output of a planner and have a threshold which needs to be passed?
feels like there may be a DAG in there somewhere for decision making..
We're running it in prod btw, though don't have any code to share.
Maybe I'm the equivalent of that idiot fighting against JS frameworks back when they first came out it but it feels pretty simple to just use individual clients and have pydantic load/validate the output.
It's not really the authors' faults, it's just a weird new problem with lots of unknowns. It's hard to get the design and abstractions correct. I've had the benefit of a lot of time at work to build my own wrapper (solely for NLP problems) and that's still an ongoing process.
As an aside: one thing I've tried to use ChatGPT for is to select applicable options from a list. When I index the list as 1..., 2... Etc. I find that the LLM likes to just start printing out ascending numbers.
What I've found kind of works is indexing by African names, e.g Thandokazi, Ntokozo, etc. then the AI seems to have less bias.
Curios what others have done in this case
Im terms of coding, I managed to get AI to build a simple working collaborative app but beyond a certain point, it doesn't understand nuance and it kept breaking stuff that it had fixed previously even with Claude where it kept our entire conversation context. Beyond a certain degree of completion, it was simply easier and faster to write the code myself than to tell the AI to write it because it just didn't get it, no matter how precise I was with my wording because it became like playing a game of whac-a-mole; fixed one thing, broke 2 others.
You are correct that we should be using expert AIs rather than general purpose ones when possible though.
The real proof though is that most "prompt engineers" already use chatgpt/claude to take their outline prompt and reword it for succinctness and relevance to LLMs, have it suggest revisions and so forth. Not only is the process amenable to automation, but people are already doing hybrid processes leveraging the AI anyhow.
We have deterministic programming systems. They're called compilers.
“If you don’t do as I say, people will get hurt. Do exactly as I say, and do it fast.”
Increases accuracy and performance by an order of magnitude.
"Say that again but slur your words like you're coming home sloshed from the office Christmas party."
Increases the jei nei suis qua by an order of magnitude.
"je ne sais quoi", i.e. "I don't know (exactly) what", or an intangible but essential quality. :)
Or these prompts might cause wild variations based on the model and any study you do is basically useless for the near future as the models evolve by themselves.
Basically, in the context window, you provide your model with 5 or more example inputs and outputs. If you’re running in chat mode, that’s be the preceding 5 user and assistant message pairs, which establish a pattern of how to answer to different types of information. Then you give the current prompt as a user, and the assistance will follow the rhythm and style of previous answers in the context window.
It works so well I was able to take out answer reformatting logic out of some of my programs that query llama2 7b. And it’s a lot cheaper than fine-tuning, which may be overkill for simple applications.