Image Descriptives by GPT4o (opens in new tab)

(github.com)

191 pointsyigitkonur351y ago91 comments

I've developed a Python API service that uses GPT-4o for OCR on PDFs. It features parallel processing and batch handling for improved performance. Not only does it convert PDF to markdown, but it also describes the images within the PDF using captions like `[Image: This picture shows 4 people waving]`.

In testing with NASA's Apollo 17 flight documents, it successfully converted complex, multi-oriented pages into well-structured Markdown.

The project is open-source and available on GitHub. Feedback is welcome.

91 comments

Oras1y ago

While this is a nice development, it’s quite risky parsing documents with LLMs. In usual OCRs, you have boundaries to check, but with LLMs, you just get a black box output.

As others mentioned, consistency is key in parsing documents and consistency is not a feature of LLMs.

The output might look plausible, but without proper validation this is just a nice local playground that can’t make it to production.

drdaeman1y ago

Speaking of the devil - I've just had hallucinations with Ollama and reader-lm model (converting HTML to Markdown) the other day. In 40% of cases it spew out things that weren't in the input (not exactly surprising, knowing that it's a generative model).

Turns out the model needs temperature of zero (and then it seem to behave well, at least in simple tests), but it wasn't in the model settings.

https://github.com/ollama/ollama/issues/6875#issuecomment-23...

yigitkonur35OP1y ago

You're spot on. We shouldn't lump all LLMs together. This approach might work wonders for Anthropic and OpenAI's top-tier models, but it could fall flat with smaller, less complex ones.

I purposely set the temperature to 0.1, thinking the LLM might need a little wiggle room when whipping up those markdown tables. You know, just enough leeway to get creative if needed.

yigitkonur35OP1y ago

I get your worries about LLMs and their consistency problems. But I think we can fix a lot of that using LLMs themselves for checks. If you're after top-notch accuracy, you could throw in another prompt, add some visual and text input, and double-check that nothing's lost in translation. The cheaper models are actually great for this kind of quality control. LLMs have come a long way since they first showed up, and I reckon they've stepped up their game enough to shake off that bad rap for giving mixed signals.

Oras1y ago

How would you know something is missing?

I tried multiple OCRs before and it’s hard to tell if the output is accurate or not but just comparing manually.

I created a tool to visualise the output of OCR [0] to see what’s missing and there are many cases that would be quite concerning especially when working with financial data.

This tool wouldn’t work with LLMs as they don’t return the character recognition (to my knowledge), which will make it harder to evaluate them on a scale.

If I want to use LLMs for the task, I would use them to help with training ML model to do OCR better, such as creating thousands of synthetic data to train.

[0] https://github.com/orasik/parsevision

yigitkonur35OP1y ago

Wow, you knocked it out of the park! I'll be sure to use this when I tackle that evaluation.

whiplash4511y ago

If you can use an LLM for sanity checking, why can’t you use it for extraction at the first place?

ithkuil1y ago

Because currently models output a stream of tokens directly which are the performance and billing unit. Better models can do a better job at producing reasonable output but there is a limit to what can be done "on the fly".

Some models like openai o1 started employing internal "thinking" tokens which may or may not be equivalent to performing multiple passes with the same or different models but it has a similar effect.

One way to look at it is that if you want better results you have to put more computational resources in thinking. Also, just like humans, a team effort yields better results in producing well rounded results because you combine the strengths and you offset the weaknesses of different team members.

You can technically wrap all this into a single black box and have it converse with you as if it was one single entity that internally uses multiple models to think and cross check etc. The output is likely not going to be in real-time though and real time conversation was until now a very important feature.

In future we may on one hand relax the real time constraint and accept that for some tasks accuracy is more important than real time results.

Or we may eventually have faster machines or more clever algorithms that may "think" more in shorter amounts of time.

(Or a combination of the two)

cyanydeez1y ago

Determinism is also up there because post processing can catch and fix common errors

zerop1y ago

I had been using GPT4o for extracting insights from Scanned docs, it was doing fine. But very recently (since they launched new model - o1), it's not working. GPT4o is refusing to extract text from images and says it can't do it, though it was doing same thing with same prompts till last week. I am not sure if this is intentional downgrade and it can be clubbed with new model launch, but it's really frustrating for me. I cancelled my GPT4 premium and moved to claude. It works good.

itissid1y ago

This. Inconsistency is a big problem for large tasks, you are better off making your own models to do this.

I have seen this odd kind of inconsistency in generating the same results, sometimes in the same chat itself after starting off fine.

I was once trying to extract hand written dates and times from a large pdf document in batches of 10 pages at a time from a very specific part of the page. IN some documents it started by refusing, but not in other different chat windows that I tried with the same document. Sometimes it would say there is an error, and then it would work in a new chat window. But I am not sure why, but just starting a new chat works for these kind of situations.

Sometimes it will start off fine with OCR, then as the task progresses, it will start hallucinating. Even though the text to be extracted follows a pattern like dates, it for the life of me could not get it right.

rrrix11y ago

> "...you are better off making your own models to do this"

I'm doubtful you meant what you wrote here. Using a readymade UI or API to perform an effectively magical task (for most of us) is an entirely different paradigm to "just train your own model."

In reality, for us non-ML model training mortals, we're actually probably better off hiring a human to do basic data entry.

avibhu1y ago

Have you tried few shot prompting? Something on the lines of:

User: Extract x from the given scanned document. <sample_img_1>

Assistant: <sample_img_1_output>

User: Extract x from the given scanned document. <sample_img_2>

Assistant: <sample_img_2_output>

User: Extract x from the given scanned document. <query_image>

In my experience, this seems to make the model significantly more consistent.

yigitkonur35OP1y ago

For highly consistent responses, manually transcribing the most challenging page of the document (or engaging in multiple rounds of dialogue with Claude) and incorporating it as a few-shot example can dramatically improve overall consistency.

yigitkonur35OP1y ago

I ought to test this with Sonnet too and compare the results. I feel it might perform better on OCR tasks. While I went with Azure OpenAI due to fewer rate restrictions, you've got a point - Sonnet could really shine here.

eth0up1y ago

I have observed a lot of similar contradictions, where the large lout insists it can't do something that it did many times 'last week'.

Super frustrating when really trying to accomplish something!

fragmede1y ago

Why not just switch back to GPT-4? it's still there.

itchyjunk1y ago

So is 4o. Problem isn't the absence of model, it's inconsistency.

pierre1y ago

Parsing docs using LVM is the way forward (also see OCR2 paper released last week, people are having ablot of success parsing with fine tunned Qwen2).

The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...)

However this model will get better and we may soon have a good pdf to md model.

fzysingularity1y ago

We’ve been doing exactly this by doubling-down on VLMs (https://vlm.run)

- VLMs are way better at handling layout and context where OCR systems fail miserably

- VLMs read documents like humans do, which makes dealing with special layouts like bullets, tables, charts, footnotes much more tractable with a singular approach rather than have to special case a whole bunch of OCR + post-processing

- VLMs are definitely more expensive, but can be specialized and distilled for accurate and cost effective inference

In general, I think vision + LLMs can be trained to explicitly to “extract” information and avoid reasoning/hallucinating about the text. The reasoning can be another module altogether.

yigitkonur35OP1y ago

I did a ton of Googling before writing this code, but I couldn't find you guys anywhere. If I had, I'd have definitely used your stuff. You might want to think about running some small-scale Google Ads campaigns. They could be especially effective if you target people searching for both LLM and OCR together. Great product, congratz!

fzysingularity1y ago

Hey, thanks! DM me if you want to test it out (sudeep@vlm.run).

Agreed on SEO - we’re redoing our landing page and searchability. We recently rebranded, hence the lack of direct search hits for LLM / OCR.

authorfly1y ago

What about combining old school OCR with GPT visual OCR?

If your old school OCR output has output that is not present in the visual one, but is coherent (e.g. english sentences), you could get it back and slot it into the missing place from the visual output.

yigitkonur35OP1y ago

You're absolutely right. I use PDFTron (through CloudCovert) for full document OCR, but for pages with fewer than 100 characters, I switch to this API. It's a great combo – I get the solid OCR performance of SolidDocument for most content, but I can also handle tricky stuff like stats, old-fashioned text, or handwriting that regular OCR struggles with. That's why I added page numbers upfront.

fkilaiwi1y ago

what paper are you referring to?

perrywky1y ago

I guess this: https://arxiv.org/html/2409.01704v1

constantinum1y ago

There is also LLMWhisperer, a document pre-processor specifically made for LLM consumption.

As other mentioned, accuracy is the one part of solution criteria, other include, how does the preprocessing engine scale/performs at large scale, and how does it handle very complex documents like, bank loan forms with checkboxes, IRS tax forms with multi-layered nested tables etc.

https://unstract.com/llmwhisperer/

LLMWhisperer is a part of Unstract - An open-source tool for unstructured document ETL.

https://github.com/Zipstack/unstract

jdthedisciple1y ago

Zerox [0] was featured on here recently and does the exact same thing

[0] https://github.com/getomni-ai/zerox

smusamashah1y ago

I have not found any mention of accuracy. Since it's using LLM, how accurate the conversion is? As in does that NASA document match 100% with the pdf or did it introduce any made up things (hallucinations)?

That converted NASA doc should be included in repo and linked in readme if you haven't already.

uLogMicheal1y ago

Couldn't one just program a linear word matching test to ensure correctness?

yigitkonur35OP1y ago

People are really freaked out about hallucinations, but you can totally tackle that with solid prompts. The one in the repo right now is doing a pretty good job. Keep in mind though, this project is all about maxing out context for LLMs in products that need PDF input.

We're not talking about some hardcore archiving system for the Library of Congress here. The goal is to boost consistency whenever you're feeding PDF context into an LLM-powered tool. Appreciate the feedback, I'll be sure to add that in.

williamdclt1y ago

I don’t think any prompting skill guarantees the absence of hallucination. And if hallucination is possible, you will usually need to worry about it

Foobar85681y ago

As soon as you have something else than a paragraph in a single column layout, you will get hallucinations, random stuff, cut off etc even if you say which pages to look at, LLM will just do what ever.

AdieuToLogic1y ago

> People are really freaked out about hallucinations, but you can totally tackle that with solid prompts.

> The goal is to boost consistency whenever you're feeding PDF context into an LLM-powered tool.

These two assertions are contradictory.

There are no "solid prompts" which obviate anthropomorphic "LLM hallucinations." Also, there is no deterministic consistency when "feeding PDF context" into an intrinsically non-deterministic algorithm, as any "LLM-powered tool" is by definition.

freedomben1y ago

Can you give some examples of prompts that you use that will tackle hallucinations?

smusamashah1y ago

> hallucinations, but you can totally tackle that with solid prompts.

This is so wrong. This so much sound as if you have not used LLMs to do any real work.

bravura1y ago

I've also been using the nougat models from meta, which are trained to turn PDF into md using the donut architecture

rahimnathwani1y ago

Does it work well on documents that aren't academic papers?

https://facebookresearch.github.io/nougat/

charlie01y ago

I do this all the time for old specs, but one issue I worry about is accuracy. It's hard to confirm if the translations are 100% correct.

TZubiri1y ago

Ok attempt number 158 at parsing pdfs, here we go, this time surely it will work.

eth0up1y ago

I used GPT4o to convert heavily convoluted PDFs into csv files. The files were Florida Lottery Pick(n) histories, which they deliberately convolute to prevent automatic searching; ctrl-f does nothing and a fsck-ton of special characters embellish the whole file.

I had previously done so manually, with regex, and was surprised with the quality of the end results of GPT, despite many preceding failed iterations. The work was done in two steps, first with pdf2text, then python.

I'm still trying to created a script to extract the latest numbers from the FL website and append to a cvs list, without re-running the stripping script on the whole PDF every time. Why? I want people to have the ability to freely search the entire history of winning numbers, which in their web hosted search function, is limited to only two of 30+ years.

I know there's a more efficient method, but I don't know more than that.

yigitkonur35OP1y ago

I really appreciate you sharing your hands-on experience with a real-world scenario. It's interesting how people unfamiliar with traditional OCR often doubt LLMs, but having worked with actual documents, I know how inefficient classic OCR methods can be. So these minor errors don't alarm me at all. Your use case sounds fascinating - I might just incorporate it into my own benchmarks. Thanks again for your insightful comment!

mmh00001y ago

This sounds like a fun and interesting challenge! I am tempted to try it on my own

I’m surprised an LLM actually works for that purpose. It has been my experience with gpt reading pdfs that it’ll get the first few entries from a pdf correct then just start making up numbers.

I’ve tried a few times having gpt4 analyze a credit card statement and it adds random purchases and leaves out others. And that’s with a “clean” PDF. I wouldn’t trust an llm at all on an obfuscated pdf, at least not without thorough double checking.

eth0up1y ago

>then just start making up numbers...

Absolutely! It's a fucking criminal in that regard. But that's why everything is done with hard python code and the results are tested multiple times. As an assistant, gpt can be fabulous, but the user must run the necessary scripts on their own and be ever ready for a knife in the back at any moment.

Edit: below is an example of what it generated after a lot of debugging and hassle:

  import re

import csv from datetime import datetime

def clean_and_structure_data(text): """Cleans and structures the extracted text data.""" # Regular expression pattern to match the lottery data pattern = r'(\d{2}/\d{2}/\d{2})\s+(E|M)\s+(\d{1})\s-\s(\d{1})\s-\s(\d{1})\s-\s(\d{1})(?:\s+FB\s+(\d))?' matches = re.findall(pattern, text)

    structured_data = []
    for match in matches:
        date, draw_type, n1, n2, n3, n4, fireball = match
        # Format the date to include the full year
        date = datetime.strptime(date, '%m/%d/%y').strftime('%m/%d/%Y')
        # Concatenate the numbers, ensuring leading zeros are preserved, and enclose in quotes
        numbers = f'"{n1}{n2}{n3}{n4}"'
        structured_data.append({
            'Date': date,
            'Draw': draw_type,
            'Numbers': numbers,
            'Fireball': fireball or ''  # Use empty string if Fireball is None
        })
    return structured_data

def save_to_csv(data, output_path): """Saves the structured data to a CSV file.""" # Sort data by date in descending order sorted_data = sorted(data, key=lambda x: datetime.strptime(x['Date'], '%m/%d/%Y'), reverse=True)

    with open(output_path, 'w', newline='') as csvfile:
        fieldnames = ['Date', 'Draw', 'Numbers', 'Fireball']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for row in sorted_data:
            writer.writerow(row)

def main(): # Path to the text file txt_path = 'PICK4.txt' # Ensure this path points to your actual text file output_csv_path = 'output.csv' # Ensure this path is where you want the CSV file saved

    try:
        with open(txt_path, 'r') as file:
            text = file.read()
       
        cleaned_data = clean_and_structure_data(text)
        save_to_csv(cleaned_data, output_csv_path)
        print(f"Data successfully extracted and saved to {output_csv_path}")
    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__": main()

ssl-31y ago

I had the same problem with a PDF schematic for a BTT Octopus 3d printer board (which is published on their Github repo).

Unsearchable, weird characters behind the curtain, and etc.

But I don't blame deliberate obfuscation (or any other deliberate attempt to hide information) at all.

Instead, I simply blame incompetence.

(There's a ton of shitty PDFs in the world; this is just an example that I've encountered recently.)

alchemist1e91y ago

Off topic - but the obvious follow up question is why do you want people to have this ability to search the entire history?

eth0up1y ago

Thanks for asking...

1) I'm a rebel

2) I am irritated by deliberate obfuscations of public data, especially by a source that I suspect is corrupt. Although my extensive analysis has not yet revealed any significant pattern anomalies in their numbers.

3) It's kind of my re-intro into python, which I never made significant progress in but always wanted to.

4) It's literally the real history of all winning numbers since inception. Individuals may have various reasons for accessing this data, but I've been using it to test for manipulation. I presume for most folks it would be curiosity, or gambler's fallacy type stuff. Regardless, it shouldn't be obfuscated.

alchemist1e91y ago

I had suspected you’re are suspicious of manipulation. I have heard many rumors of lottery corruption and manipulation.

It’s certainly a big red flag if they are deliberately obstructing access to the data.

Make sense your project and I’d probably take 30 mins to look at the data if I came across it. I’m somewhat decent at data and number analysis so if there is something and enough people can easily take a look at it, then it might get exposed.

Interesting and good luck.

is_true1y ago

There are private APIs that have that data (now and history)

Do you think the official data published is 100% correct if they were trying to hide something?

eth0up1y ago

I am honestly not certain why they obstruct easy access to the number history. It's obviously accessible, but only through manually parsing the PDF. Their prior embedded search function, approximately two years ago, would return all permutations of the queried number from day 1 to present. They modified it to exclude results more than two years old. The PDF contains the entire data set, but isn't searchable. Why? Dunno. But I'm cynical

I've also compiled a list of all numbers that have never occurred, count of each occurrence and a lot more. My anomaly analytics have included everything, as an ignoramous, I can throw at it; chi squared; isolated forest; time series; and a lot of stuff I don't properly understand. Most anomalies found have been, if narrowly, within expected randomness, but I intend to fortify my proddings eventually. Although I'm actually confident I'm barking up the wrong tree, the data obfuscation is objectively dubious, for whatever the reason.

is_true1y ago

I've worked in the field and it could just be that the developers in charge of the new site didn't know/care how to get the data from the old system.

1 more reply

refulgentis1y ago

GPT 4o doesn't do actual OCR and there's much smaller and more effective models for specifically this problem.

I appreciate your work, intent, and sharing it. It's very important to appreciate what you're doing and its context when sharing it.

At that point, you are responsible for it, and the choices you make when communicating about it reflect on you.

yigitkonur35OP1y ago

I've found this method really useful for prepping PDFs before running them through AI. I mix it with traditional OCR for a hybrid approach. It's a game-changer for getting info from tricky pages. Sure, you wouldn't bet the farm on it for a big, official project, but it's still pretty solid. If you're willing to spend a bit more, you can use extra prompts to check for any context skips. It's a lot of work, though - probably best left to companies that specialize in this stuff.

I've been testing it out on pitch decks made in Figma and saved as JPGs. Surprisingly, the LLM OCR outperformed top dogs like SolidDocuments and PDFtron. Since I'm mainly after getting good context for the LLM from PDFs, I've been using this hybrid setup, bringing in the LLM OCR for pages that need it. In my book, this API is perfect for these kinds of situations.

fzysingularity1y ago

One nit in the repo README - you might want to change the cost reporting to be as $15 / 1000 pages instead of documents.

KoolKat231y ago

This is handy, one thing I've noticed using 3.5 Sonnet, the tables that aren't the correct orientation are more prone to incorrect output.

I know this was an issue when GPT 4 vision initially came out due to training, not sure if it's a solved problem or if your tool handles this.

jdross1y ago

How does this compare with commercial OCR APIs on a cost per page?

yigitkonur35OP1y ago

It is a lot cheaper! While cost-effectiveness may not be the primary advantage, this solution offers superior accuracy and consistency. Key benefits include precise table generation and output in easily editable markdown format.

Let's make some numbers game:

- Average token usage per image: ~1200 - Total tokens per page (including prompt): ~1500 - [GPT4o] Input token cost: $5 per million tokens - [GPT4o] Output token cost: $15 per million tokens

For 1000 documents: - Estimated total cost: $15

This represents excellent value considering the consistency and flexibility provided. For further cost optimization, consider:

1. Utilizing GPT4 mini: Reduces cost to approximately $8 per 1000 documents 2. Implementing batch API: Further reduces cost to around $4 per 1000 documents

I think it offers an optimal balance of affordability & reliability.

PS: One of the most affordable solution on market, cloudconvert charges ~30$ for 1K document (pdftron mode required 4 credits)

johndough1y ago

> I think it offers an optimal balance of affordability & reliability.

It is hard to trust "you" when ChatGPT wrote that text. You never know which part of the answer is genuine and which part was made up by ChatGPT.

To actually answer that question: Pricing varies quite a bit depending on what exactly you want to do with a document.

Text detection generally costs $1.5 per 1k pages:

https://cloud.google.com/vision/pricing

https://aws.amazon.com/textract/pricing/

https://azure.microsoft.com/en-us/pricing/details/ai-documen...

yigitkonur35OP1y ago

You've got a point, but try testing it on a tricky example like the Apollo 17 document - you know, with those sideways tables and old-school writing. You'll see all three non-AI services totally bomb. Now, if you tweak it to batch = 1 instead of 10, you'll notice there's hardly any made-up stuff. When you dial down the temperature close to zero, it's super unlikely to see hallucinations with limited context. At worst, you might get some skipped bits, but that's not a dealbreaker for folks looking to feed PDFs into AI systems. Let's face it, regular OCR already messes up so much that...

5 more replies

magicalhippo1y ago

Was just looking for something like this. Does it handle equations to latex or similar? How about rotated tables, ie landscape mode but page is still portait?

yigitkonur35OP1y ago

I messed around with some rotating tables in that Apollo 17 demo video - you can check it out in the repo if you want. It's pretty straightforward to tweak just by changing the prompt. You can customize that prompt section in the code to fit whatever you need.

Oh, and if you throw in a line about LaTeX, it'll make things even more consistent. Just add it to that markdown definition part I set up. Honestly, it'll probably work pretty well as is - should be way better than those clunky old OCR systems.

x_may1y ago

Check out Nougat from meta

magicalhippo1y ago

Thanks, looks very interesting, but also somewhat abandoned. Will keep an eye on it in case someone picks up the torch.

nicodjimenez1y ago

Have you checked out Mathpix? It's another option.

Disclaimer: I'm the founder.

troysk1y ago

+1! Most LLMs can already output Mathpix markdown. I prompt it to do so and it gives the code and this use a rendering library to show the scalable and selectable equations. No wonder facebook nougat also uses it. Good stuff!

magicalhippo1y ago

Was looking for a self-hosted solution as I have quite on/off needs, but I'll give it a whirl as it looks quite promising.

devops0001y ago

You could transform arXiv to a markdown website

scottmcdot1y ago

Does it do image to MD too?

vunderba1y ago

Unless the only thing you want is a description of the image, then the real answer is NO. You can get an LLM to do something like "If you encounter an image that is not easily convertable to standard markdown, insert a [[DESCRIPTION OF IMAGE]] here." placeholder, but at that point you've lost information that may be salient to the original PDF.

The reason is because these multimodal LLMs can give you descriptions/OCR/etc., but they cannot give you quantifiable information related to placement.

So if there was a picture of a tiger in the middle of the page converted to a bitmap, you couldn't get the LLM to give you something like this: "Image detected at pixel position (120, 200) - (240, 500)." - because that's really want you want.

You almost need segmentation system middleware that the LLM can forward to which can cut out these images to use in markdown syntax:

    ![A tiger burning brightly](/assets/images/tiger.png)

yigitkonur35OP1y ago

Yes, you can customize this as you wish by adding it to your prompt.

wittjeff1y ago

license please?

bschmidt11y ago

My previous employer needs this.

I won't tell them :) :D >:D :|

j / k navigate · click thread line to collapse

91 comments

Oras1y ago

While this is a nice development, it’s quite risky parsing documents with LLMs. In usual OCRs, you have boundaries to check, but with LLMs, you just get a black box output.

As others mentioned, consistency is key in parsing documents and consistency is not a feature of LLMs.

The output might look plausible, but without proper validation this is just a nice local playground that can’t make it to production.

drdaeman1y ago

Turns out the model needs temperature of zero (and then it seem to behave well, at least in simple tests), but it wasn't in the model settings.

https://github.com/ollama/ollama/issues/6875#issuecomment-23...

yigitkonur35OP1y ago

You're spot on. We shouldn't lump all LLMs together. This approach might work wonders for Anthropic and OpenAI's top-tier models, but it could fall flat with smaller, less complex ones.

I purposely set the temperature to 0.1, thinking the LLM might need a little wiggle room when whipping up those markdown tables. You know, just enough leeway to get creative if needed.

yigitkonur35OP1y ago

Oras1y ago

How would you know something is missing?

I tried multiple OCRs before and it’s hard to tell if the output is accurate or not but just comparing manually.

I created a tool to visualise the output of OCR [0] to see what’s missing and there are many cases that would be quite concerning especially when working with financial data.

This tool wouldn’t work with LLMs as they don’t return the character recognition (to my knowledge), which will make it harder to evaluate them on a scale.

If I want to use LLMs for the task, I would use them to help with training ML model to do OCR better, such as creating thousands of synthetic data to train.

[0] https://github.com/orasik/parsevision

yigitkonur35OP1y ago

Wow, you knocked it out of the park! I'll be sure to use this when I tackle that evaluation.

whiplash4511y ago

If you can use an LLM for sanity checking, why can’t you use it for extraction at the first place?

ithkuil1y ago

Some models like openai o1 started employing internal "thinking" tokens which may or may not be equivalent to performing multiple passes with the same or different models but it has a similar effect.

In future we may on one hand relax the real time constraint and accept that for some tasks accuracy is more important than real time results.

Or we may eventually have faster machines or more clever algorithms that may "think" more in shorter amounts of time.

(Or a combination of the two)

cyanydeez1y ago

Determinism is also up there because post processing can catch and fix common errors

zerop1y ago

itissid1y ago

This. Inconsistency is a big problem for large tasks, you are better off making your own models to do this.

I have seen this odd kind of inconsistency in generating the same results, sometimes in the same chat itself after starting off fine.

rrrix11y ago

> "...you are better off making your own models to do this"

I'm doubtful you meant what you wrote here. Using a readymade UI or API to perform an effectively magical task (for most of us) is an entirely different paradigm to "just train your own model."

In reality, for us non-ML model training mortals, we're actually probably better off hiring a human to do basic data entry.

avibhu1y ago

Have you tried few shot prompting? Something on the lines of:

User: Extract x from the given scanned document. <sample_img_1>

Assistant: <sample_img_1_output>

User: Extract x from the given scanned document. <sample_img_2>

Assistant: <sample_img_2_output>

User: Extract x from the given scanned document. <query_image>

In my experience, this seems to make the model significantly more consistent.

yigitkonur35OP1y ago

eth0up1y ago

I have observed a lot of similar contradictions, where the large lout insists it can't do something that it did many times 'last week'.

Super frustrating when really trying to accomplish something!

fragmede1y ago

Why not just switch back to GPT-4? it's still there.

itchyjunk1y ago

So is 4o. Problem isn't the absence of model, it's inconsistency.

pierre1y ago

Parsing docs using LVM is the way forward (also see OCR2 paper released last week, people are having ablot of success parsing with fine tunned Qwen2).

However this model will get better and we may soon have a good pdf to md model.

fzysingularity1y ago

We’ve been doing exactly this by doubling-down on VLMs (https://vlm.run)

- VLMs are way better at handling layout and context where OCR systems fail miserably

- VLMs are definitely more expensive, but can be specialized and distilled for accurate and cost effective inference

In general, I think vision + LLMs can be trained to explicitly to “extract” information and avoid reasoning/hallucinating about the text. The reasoning can be another module altogether.

yigitkonur35OP1y ago

fzysingularity1y ago

Hey, thanks! DM me if you want to test it out (sudeep@vlm.run).

Agreed on SEO - we’re redoing our landing page and searchability. We recently rebranded, hence the lack of direct search hits for LLM / OCR.

authorfly1y ago

What about combining old school OCR with GPT visual OCR?

yigitkonur35OP1y ago

fkilaiwi1y ago

what paper are you referring to?

perrywky1y ago

I guess this: https://arxiv.org/html/2409.01704v1

constantinum1y ago

There is also LLMWhisperer, a document pre-processor specifically made for LLM consumption.

https://unstract.com/llmwhisperer/

LLMWhisperer is a part of Unstract - An open-source tool for unstructured document ETL.

https://github.com/Zipstack/unstract

jdthedisciple1y ago

Zerox [0] was featured on here recently and does the exact same thing

[0] https://github.com/getomni-ai/zerox

smusamashah1y ago

That converted NASA doc should be included in repo and linked in readme if you haven't already.

uLogMicheal1y ago

Couldn't one just program a linear word matching test to ensure correctness?

yigitkonur35OP1y ago

williamdclt1y ago

I don’t think any prompting skill guarantees the absence of hallucination. And if hallucination is possible, you will usually need to worry about it

Foobar85681y ago

AdieuToLogic1y ago

> People are really freaked out about hallucinations, but you can totally tackle that with solid prompts.

> The goal is to boost consistency whenever you're feeding PDF context into an LLM-powered tool.

These two assertions are contradictory.

freedomben1y ago

Can you give some examples of prompts that you use that will tackle hallucinations?

smusamashah1y ago

> hallucinations, but you can totally tackle that with solid prompts.

This is so wrong. This so much sound as if you have not used LLMs to do any real work.

bravura1y ago

I've also been using the nougat models from meta, which are trained to turn PDF into md using the donut architecture

rahimnathwani1y ago

Does it work well on documents that aren't academic papers?

https://facebookresearch.github.io/nougat/

charlie01y ago

I do this all the time for old specs, but one issue I worry about is accuracy. It's hard to confirm if the translations are 100% correct.

TZubiri1y ago

Ok attempt number 158 at parsing pdfs, here we go, this time surely it will work.

eth0up1y ago

I know there's a more efficient method, but I don't know more than that.

yigitkonur35OP1y ago

mmh00001y ago

This sounds like a fun and interesting challenge! I am tempted to try it on my own

I’m surprised an LLM actually works for that purpose. It has been my experience with gpt reading pdfs that it’ll get the first few entries from a pdf correct then just start making up numbers.

eth0up1y ago

>then just start making up numbers...

Edit: below is an example of what it generated after a lot of debugging and hassle:

  import re

import csv from datetime import datetime

    structured_data = []
    for match in matches:
        date, draw_type, n1, n2, n3, n4, fireball = match
        # Format the date to include the full year
        date = datetime.strptime(date, '%m/%d/%y').strftime('%m/%d/%Y')
        # Concatenate the numbers, ensuring leading zeros are preserved, and enclose in quotes
        numbers = f'"{n1}{n2}{n3}{n4}"'
        structured_data.append({
            'Date': date,
            'Draw': draw_type,
            'Numbers': numbers,
            'Fireball': fireball or ''  # Use empty string if Fireball is None
        })
    return structured_data

    with open(output_path, 'w', newline='') as csvfile:
        fieldnames = ['Date', 'Draw', 'Numbers', 'Fireball']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for row in sorted_data:
            writer.writerow(row)

def main(): # Path to the text file txt_path = 'PICK4.txt' # Ensure this path points to your actual text file output_csv_path = 'output.csv' # Ensure this path is where you want the CSV file saved

    try:
        with open(txt_path, 'r') as file:
            text = file.read()
       
        cleaned_data = clean_and_structure_data(text)
        save_to_csv(cleaned_data, output_csv_path)
        print(f"Data successfully extracted and saved to {output_csv_path}")
    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__": main()

ssl-31y ago

I had the same problem with a PDF schematic for a BTT Octopus 3d printer board (which is published on their Github repo).

Unsearchable, weird characters behind the curtain, and etc.

But I don't blame deliberate obfuscation (or any other deliberate attempt to hide information) at all.

Instead, I simply blame incompetence.

(There's a ton of shitty PDFs in the world; this is just an example that I've encountered recently.)

alchemist1e91y ago

Off topic - but the obvious follow up question is why do you want people to have this ability to search the entire history?

eth0up1y ago

Thanks for asking...

1) I'm a rebel

3) It's kind of my re-intro into python, which I never made significant progress in but always wanted to.

alchemist1e91y ago

I had suspected you’re are suspicious of manipulation. I have heard many rumors of lottery corruption and manipulation.

It’s certainly a big red flag if they are deliberately obstructing access to the data.

Interesting and good luck.

is_true1y ago

There are private APIs that have that data (now and history)

Do you think the official data published is 100% correct if they were trying to hide something?

eth0up1y ago

is_true1y ago

I've worked in the field and it could just be that the developers in charge of the new site didn't know/care how to get the data from the old system.

1 more reply

refulgentis1y ago

GPT 4o doesn't do actual OCR and there's much smaller and more effective models for specifically this problem.

I appreciate your work, intent, and sharing it. It's very important to appreciate what you're doing and its context when sharing it.

At that point, you are responsible for it, and the choices you make when communicating about it reflect on you.

yigitkonur35OP1y ago

fzysingularity1y ago

One nit in the repo README - you might want to change the cost reporting to be as $15 / 1000 pages instead of documents.

KoolKat231y ago

This is handy, one thing I've noticed using 3.5 Sonnet, the tables that aren't the correct orientation are more prone to incorrect output.

I know this was an issue when GPT 4 vision initially came out due to training, not sure if it's a solved problem or if your tool handles this.

jdross1y ago

How does this compare with commercial OCR APIs on a cost per page?

yigitkonur35OP1y ago

Let's make some numbers game:

- Average token usage per image: ~1200 - Total tokens per page (including prompt): ~1500 - [GPT4o] Input token cost: $5 per million tokens - [GPT4o] Output token cost: $15 per million tokens

For 1000 documents: - Estimated total cost: $15

This represents excellent value considering the consistency and flexibility provided. For further cost optimization, consider:

1. Utilizing GPT4 mini: Reduces cost to approximately $8 per 1000 documents 2. Implementing batch API: Further reduces cost to around $4 per 1000 documents

I think it offers an optimal balance of affordability & reliability.

PS: One of the most affordable solution on market, cloudconvert charges ~30$ for 1K document (pdftron mode required 4 credits)

johndough1y ago

> I think it offers an optimal balance of affordability & reliability.

It is hard to trust "you" when ChatGPT wrote that text. You never know which part of the answer is genuine and which part was made up by ChatGPT.

To actually answer that question: Pricing varies quite a bit depending on what exactly you want to do with a document.

Text detection generally costs $1.5 per 1k pages:

https://cloud.google.com/vision/pricing

https://aws.amazon.com/textract/pricing/

https://azure.microsoft.com/en-us/pricing/details/ai-documen...

yigitkonur35OP1y ago

5 more replies

magicalhippo1y ago

Was just looking for something like this. Does it handle equations to latex or similar? How about rotated tables, ie landscape mode but page is still portait?

yigitkonur35OP1y ago

x_may1y ago

Check out Nougat from meta

magicalhippo1y ago

Thanks, looks very interesting, but also somewhat abandoned. Will keep an eye on it in case someone picks up the torch.

nicodjimenez1y ago

Have you checked out Mathpix? It's another option.

Disclaimer: I'm the founder.

troysk1y ago

magicalhippo1y ago

Was looking for a self-hosted solution as I have quite on/off needs, but I'll give it a whirl as it looks quite promising.

devops0001y ago

You could transform arXiv to a markdown website

scottmcdot1y ago

Does it do image to MD too?

vunderba1y ago

The reason is because these multimodal LLMs can give you descriptions/OCR/etc., but they cannot give you quantifiable information related to placement.

You almost need segmentation system middleware that the LLM can forward to which can cut out these images to use in markdown syntax:

    ![A tiger burning brightly](/assets/images/tiger.png)

yigitkonur35OP1y ago

Yes, you can customize this as you wish by adding it to your prompt.

wittjeff1y ago

license please?

bschmidt11y ago

My previous employer needs this.

I won't tell them :) :D >:D :|

j / k navigate · click thread line to collapse