undefined | Better HN

0 pointspotatoman221y ago0 comments

What's your loop for prompt engineering with GPT-4o? Do you feed the meta-prompter the misclassified examples? Also does the evaluation drive the synthetic data production almost like boosting?

0 comments

lmeyerov1y ago

'it varies' b/c we do everything from an interactive analytics chat agent (loiue.ai UI) to data-intensive continuous-monitoring (louie.ai pipelines) to one-off customer assists like $B court cases

1. Common themes in our development-time loop:

* We don't do synthetic data. We do real data or anonymized data. When we lack data, we go and get some. That may mean paying people, doing it ourselves, setting up simulation environments, etc.

* We start with synthetic judges, esp for scale tasks that are simple and thus considering smaller models like gpt-4o-mini (the topic here). Before we worry about expert agreement, we worry about gpt-4o agreement, and make evals that cover concerns like sample size and class imbalance...

* ... When the task is high value, e.g., tied closely to a paying customer deliverable or core product workflow, we invest more on expert evals, making calls like on how many experts and of what caliber. Informally, we've learned multiple of our teammates, despite good at what they do, can be lousy experts, while others are known for precision, even if not data people (ex: our field staff can be great!). Likewise, we hire subject matter experts as full-timers (ex: former europol/fbi equivs!), source as contractors, and, partner with our customers here.

* After a year+ of prompt engineering with different tasks, models, data, and prompt styles, there's a lot of rote tricks & standard practices we know. Most are 'static' -- you can audit a prompt for gotchas & top examples to fill in -- and a smaller number are like in the OP's suggestion of dynamic prompts where we include elements like RAG.

On the last point, it seems incredibly automatable, so I keep trying tools. I've found automatic prompt optimizers like dspy to be disappointing in being unable to match what our prompt engineers can do here: they did not do better then prompts we wrote as experts with bare bones iteration, and leaning into the tools failed to get noticeable lift. I don't think this is inherent, just they're probably eval'ing against people we would consider trainees. Ex: I see what stanford medical fellows+phds are doing for their genai publications, and they would probably benefit from dspy if it was easier, but again, we would classify them as 'interns' wrt the quality of prompt engineering I see them doing behind-the-scenes. I'm optimistic that by 2026, tools here will be useful for skilled AI engineers too, just they're not there yet.

2. It's a lot more murky when we get into online+active learning loops for LLMs & agentic pipelines.

E.g., louie.ai works with live operational databases, where there is a lot wrt people + systems you can learn from, and issues like databases changing, differences in role & expertise, data privacy, adverserial data, and even the workflows change. Another area we deal with is data streams where the physical realities they're working with changes (questions+answers about logs, news, social, etc).

IMO these are a lot harder and one of the areas a lot of our 2025 energy is going. Conversely, 'automatic prompt engineering' seems like something PhDs can make big strides in a vacuum...

potatoman22OP1y ago

Thanks! I love your focus on evaluation, it's missing in a lot of LLM products. I worked in the medical field and we valued model validation with similar importance. Our processes sound similar, too. One difference is that our customers still saw utility in models with much lower F1 than 90%. Rare events are hard to predict.

j / k navigate · click thread line to collapse