Kor: a half-baked prototype that "helps" you extract structured data using LLMs (opens in new tab)

(github.com)

123 pointsBorisWilhelms2y ago15 comments

15 comments

Does this take advantage of the new OpenAI functions api? From a quick look, I can't find any indication that it does. Although I find it tricky to disentangle the langchain abstractions, so I might be missing it. Kor's last release predates the announcement of OpenAI functions, so probably not.

Seems like this is now best done via functions, if you're using OpenAI's models? They call out "extracting structured data from text" as a key use case in their announcement.

https://openai.com/blog/function-calling-and-other-api-updat...

BorisWilhelmsOP2y ago

No, it is not using openai functions. Since it is on top of langchain it uses the LLM abstraction of it and it can be used with other models as well.

anotherpaulg2y ago

Yup, the flexibility of running against any model via langchain is super helpful.

2 more replies

kiernanmcgowan2y ago

Another tool like this is Marvin. My experience this that these work pretty well, but the world of prompt “engineering” is a very squishy one and getting the exact output format you want is not guaranteed.

https://www.askmarvin.ai/

captainmuon2y ago

Neat, I was just looking for something like this today, I think I'll give it a spin.

Does anybody here have experience with metadata extraction using LLMs? I've been thinking about it recently. and wonder if just making a big prompt and putting that into OpenGPT or even ChatGPT is really the way to go, or if there is a "cleverer" way. Maybe you could train specifically for certain fields, or use the LLM in a different way (like you can use the embeddings directly to do simularity search)?

Another idea was, if you have a lot of similar HTML documents, to not ask the LLM for the metadata, but to ask it for CSS selectors that contain the metadata fields - assuming it can deal with HTML and the data is verbatim in there. Then you should be able to get much more consistent results.

hubraumhugo2y ago

We're using LLMs to generate web scrapers and data processing steps on the fly that adapt to website changes. Using an LLM for every data extraction, as most comparable tools do, is expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.

Try it out https://kadoa.com

jshmrsn2y ago

That is a very thought provoking use case and optimization for LLMs, thanks for sharing.

nerpderp822y ago

I gave it some css paths extracted from devtools, and some sample elements with data that needed extraction and had it write a beautiful soup + regex routine to do the extractions. worked fine. Also thousands of times faster.

mark_l_watson2y ago

I have experimented with Kor several times, cool idea.

dennisy2y ago

Have you tried this on HTML?

BorisWilhelmsOP2y ago

Yes, tried it on HTML to get "metadata" that was not present in the HTML meta tags, such as author, publish date, etc. Works good.

BorisWilhelmsOP2y ago

Actually not on raw HTML, but with the WebBaseLoader from Langchain which strips away HTML tags.

1 more reply

j / k navigate · click thread line to collapse

15 comments

anotherpaulg2y ago

Seems like this is now best done via functions, if you're using OpenAI's models? They call out "extracting structured data from text" as a key use case in their announcement.

https://openai.com/blog/function-calling-and-other-api-updat...

BorisWilhelmsOP2y ago

No, it is not using openai functions. Since it is on top of langchain it uses the LLM abstraction of it and it can be used with other models as well.

anotherpaulg2y ago

Yup, the flexibility of running against any model via langchain is super helpful.

2 more replies

kiernanmcgowan2y ago

https://www.askmarvin.ai/

captainmuon2y ago

Neat, I was just looking for something like this today, I think I'll give it a spin.

hubraumhugo2y ago

Try it out https://kadoa.com

jshmrsn2y ago

That is a very thought provoking use case and optimization for LLMs, thanks for sharing.

nerpderp822y ago

mark_l_watson2y ago

I have experimented with Kor several times, cool idea.

dennisy2y ago

Have you tried this on HTML?

BorisWilhelmsOP2y ago

Yes, tried it on HTML to get "metadata" that was not present in the HTML meta tags, such as author, publish date, etc. Works good.

BorisWilhelmsOP2y ago

Actually not on raw HTML, but with the WebBaseLoader from Langchain which strips away HTML tags.

1 more reply

j / k navigate · click thread line to collapse