Ask HN: Batch Processing with OpenAI

9 pointsnantersand2y ago11 comments

So I have 10,000 paragraphs I want to individually process with a prompt.

Any packages that would handle running things async, doing retries, handling the occasional hung connection, and logging performance?

Tried helicone, but I don't think it handles the hung connections.

Just doing it manually for now, but there must be an existing solution?

11 comments

bob10292y ago

You could set up a collection in memory/SQL with a Status + UpdatedUtc column and poll the collection for incomplete items each loop until all are in the desired state.

Your state machine could be as simple as: New, Processing, Failed, Succeeded. Outer loop will query the collection every ~second for items that are New or Failed and retry them. Items that are stuck Processing for more than X seconds should be forced to Failed each loop through (you'll retry them on the next pass). Each state transition is written to a log with timestamps for downstream reporting. Failures are exclusively set by the HTTP processing machinery with timeouts being detected as noted above.

Using SQL would make iterations of your various batch processing policies substantially easier. Using a SELECT statement to determine your batch at each iteration would permit adding constraints over aggregates. For example, you could cap the # of simultaneous in-flight requests, or abandon all hope & throw if the statistics are looking poor (aka OpenAI outage).

nantersandOP2y ago

currently using a bunch of helper files for this... but this does sound like the sensible next step!

noonething2y ago

not op, but this is a neat idea, thanks.

psimm2y ago

I'm working on a Python package for this: https://github.com/qagentur/texttunnel

It's a wrapper for OpenAI's Python sample script plus adjacent functionality like cost estimation and binpacking multiple inputs into one request.

nantersandOP2y ago

Will check it out!

dsalzman2y ago

They have an example python script for batch jobs with retries. https://github.com/openai/openai-cookbook/blob/main/examples...

tmaly2y ago

What is the average paragraph length?

I would consider using one of the 32k context windows.

Define a delimiter for the paragraphs and prefix the prompt to process each then write out with a delimiter the result you need.

Maybe wrap your call in a simple try catch and do exponential back-off.

kolinko2y ago

Did you try asking chatgpt to write it? :)

For async and retries I just asked the bot to write proper code, and it worked fine. Could do it myself, but it would take longer.

With hung connections - I didn't experience those, but it should also be straightforward.

nantersandOP2y ago

I did!

But... the async code seems to cause a lot of dead connections, which seem to prevent any new ones.

aiunboxed2y ago

Open AI has a batch api for generating embeddings, rest haven't found much libraries built on top of open ai which are production ready

rboyd2y ago

I think a https://www.windmill.dev/ workflow could handle this

j / k navigate · click thread line to collapse

11 comments

bob10292y ago

You could set up a collection in memory/SQL with a Status + UpdatedUtc column and poll the collection for incomplete items each loop until all are in the desired state.

nantersandOP2y ago

currently using a bunch of helper files for this... but this does sound like the sensible next step!

noonething2y ago

not op, but this is a neat idea, thanks.

psimm2y ago

I'm working on a Python package for this: https://github.com/qagentur/texttunnel

It's a wrapper for OpenAI's Python sample script plus adjacent functionality like cost estimation and binpacking multiple inputs into one request.

nantersandOP2y ago

Will check it out!

dsalzman2y ago

They have an example python script for batch jobs with retries. https://github.com/openai/openai-cookbook/blob/main/examples...

tmaly2y ago

What is the average paragraph length?

I would consider using one of the 32k context windows.

Define a delimiter for the paragraphs and prefix the prompt to process each then write out with a delimiter the result you need.

Maybe wrap your call in a simple try catch and do exponential back-off.

kolinko2y ago

Did you try asking chatgpt to write it? :)

For async and retries I just asked the bot to write proper code, and it worked fine. Could do it myself, but it would take longer.

With hung connections - I didn't experience those, but it should also be straightforward.

nantersandOP2y ago

I did!

But... the async code seems to cause a lot of dead connections, which seem to prevent any new ones.

aiunboxed2y ago

Open AI has a batch api for generating embeddings, rest haven't found much libraries built on top of open ai which are production ready

rboyd2y ago

I think a https://www.windmill.dev/ workflow could handle this

j / k navigate · click thread line to collapse