In testing with NASA's Apollo 17 flight documents, it successfully converted complex, multi-oriented pages into well-structured Markdown.
The project is open-source and available on GitHub. Feedback is welcome.
As others mentioned, consistency is key in parsing documents and consistency is not a feature of LLMs.
The output might look plausible, but without proper validation this is just a nice local playground that can’t make it to production.
Turns out the model needs temperature of zero (and then it seem to behave well, at least in simple tests), but it wasn't in the model settings.
https://github.com/ollama/ollama/issues/6875#issuecomment-23...
I purposely set the temperature to 0.1, thinking the LLM might need a little wiggle room when whipping up those markdown tables. You know, just enough leeway to get creative if needed.
I tried multiple OCRs before and it’s hard to tell if the output is accurate or not but just comparing manually.
I created a tool to visualise the output of OCR [0] to see what’s missing and there are many cases that would be quite concerning especially when working with financial data.
This tool wouldn’t work with LLMs as they don’t return the character recognition (to my knowledge), which will make it harder to evaluate them on a scale.
If I want to use LLMs for the task, I would use them to help with training ML model to do OCR better, such as creating thousands of synthetic data to train.
I have seen this odd kind of inconsistency in generating the same results, sometimes in the same chat itself after starting off fine.
I was once trying to extract hand written dates and times from a large pdf document in batches of 10 pages at a time from a very specific part of the page. IN some documents it started by refusing, but not in other different chat windows that I tried with the same document. Sometimes it would say there is an error, and then it would work in a new chat window. But I am not sure why, but just starting a new chat works for these kind of situations.
Sometimes it will start off fine with OCR, then as the task progresses, it will start hallucinating. Even though the text to be extracted follows a pattern like dates, it for the life of me could not get it right.
I'm doubtful you meant what you wrote here. Using a readymade UI or API to perform an effectively magical task (for most of us) is an entirely different paradigm to "just train your own model."
In reality, for us non-ML model training mortals, we're actually probably better off hiring a human to do basic data entry.
User: Extract x from the given scanned document. <sample_img_1>
Assistant: <sample_img_1_output>
User: Extract x from the given scanned document. <sample_img_2>
Assistant: <sample_img_2_output>
User: Extract x from the given scanned document. <query_image>
In my experience, this seems to make the model significantly more consistent.
Super frustrating when really trying to accomplish something!
The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...)
However this model will get better and we may soon have a good pdf to md model.
- VLMs are way better at handling layout and context where OCR systems fail miserably
- VLMs read documents like humans do, which makes dealing with special layouts like bullets, tables, charts, footnotes much more tractable with a singular approach rather than have to special case a whole bunch of OCR + post-processing
- VLMs are definitely more expensive, but can be specialized and distilled for accurate and cost effective inference
In general, I think vision + LLMs can be trained to explicitly to “extract” information and avoid reasoning/hallucinating about the text. The reasoning can be another module altogether.
If your old school OCR output has output that is not present in the visual one, but is coherent (e.g. english sentences), you could get it back and slot it into the missing place from the visual output.
As other mentioned, accuracy is the one part of solution criteria, other include, how does the preprocessing engine scale/performs at large scale, and how does it handle very complex documents like, bank loan forms with checkboxes, IRS tax forms with multi-layered nested tables etc.
https://unstract.com/llmwhisperer/
LLMWhisperer is a part of Unstract - An open-source tool for unstructured document ETL.
That converted NASA doc should be included in repo and linked in readme if you haven't already.
We're not talking about some hardcore archiving system for the Library of Congress here. The goal is to boost consistency whenever you're feeding PDF context into an LLM-powered tool. Appreciate the feedback, I'll be sure to add that in.
> The goal is to boost consistency whenever you're feeding PDF context into an LLM-powered tool.
These two assertions are contradictory.
There are no "solid prompts" which obviate anthropomorphic "LLM hallucinations." Also, there is no deterministic consistency when "feeding PDF context" into an intrinsically non-deterministic algorithm, as any "LLM-powered tool" is by definition.
This is so wrong. This so much sound as if you have not used LLMs to do any real work.
I had previously done so manually, with regex, and was surprised with the quality of the end results of GPT, despite many preceding failed iterations. The work was done in two steps, first with pdf2text, then python.
I'm still trying to created a script to extract the latest numbers from the FL website and append to a cvs list, without re-running the stripping script on the whole PDF every time. Why? I want people to have the ability to freely search the entire history of winning numbers, which in their web hosted search function, is limited to only two of 30+ years.
I know there's a more efficient method, but I don't know more than that.
I’m surprised an LLM actually works for that purpose. It has been my experience with gpt reading pdfs that it’ll get the first few entries from a pdf correct then just start making up numbers.
I’ve tried a few times having gpt4 analyze a credit card statement and it adds random purchases and leaves out others. And that’s with a “clean” PDF. I wouldn’t trust an llm at all on an obfuscated pdf, at least not without thorough double checking.
Absolutely! It's a fucking criminal in that regard. But that's why everything is done with hard python code and the results are tested multiple times. As an assistant, gpt can be fabulous, but the user must run the necessary scripts on their own and be ever ready for a knife in the back at any moment.
Edit: below is an example of what it generated after a lot of debugging and hassle:
import re
import csv
from datetime import datetimedef clean_and_structure_data(text): """Cleans and structures the extracted text data.""" # Regular expression pattern to match the lottery data pattern = r'(\d{2}/\d{2}/\d{2})\s+(E|M)\s+(\d{1})\s-\s(\d{1})\s-\s(\d{1})\s-\s(\d{1})(?:\s+FB\s+(\d))?' matches = re.findall(pattern, text)
structured_data = []
for match in matches:
date, draw_type, n1, n2, n3, n4, fireball = match
# Format the date to include the full year
date = datetime.strptime(date, '%m/%d/%y').strftime('%m/%d/%Y')
# Concatenate the numbers, ensuring leading zeros are preserved, and enclose in quotes
numbers = f'"{n1}{n2}{n3}{n4}"'
structured_data.append({
'Date': date,
'Draw': draw_type,
'Numbers': numbers,
'Fireball': fireball or '' # Use empty string if Fireball is None
})
return structured_data
def save_to_csv(data, output_path):
"""Saves the structured data to a CSV file."""
# Sort data by date in descending order
sorted_data = sorted(data, key=lambda x: datetime.strptime(x['Date'], '%m/%d/%Y'), reverse=True) with open(output_path, 'w', newline='') as csvfile:
fieldnames = ['Date', 'Draw', 'Numbers', 'Fireball']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in sorted_data:
writer.writerow(row)
def main():
# Path to the text file
txt_path = 'PICK4.txt' # Ensure this path points to your actual text file
output_csv_path = 'output.csv' # Ensure this path is where you want the CSV file saved try:
with open(txt_path, 'r') as file:
text = file.read()
cleaned_data = clean_and_structure_data(text)
save_to_csv(cleaned_data, output_csv_path)
print(f"Data successfully extracted and saved to {output_csv_path}")
except Exception as e:
print(f"An error occurred: {e}")
if __name__ == "__main__":
main()Unsearchable, weird characters behind the curtain, and etc.
But I don't blame deliberate obfuscation (or any other deliberate attempt to hide information) at all.
Instead, I simply blame incompetence.
(There's a ton of shitty PDFs in the world; this is just an example that I've encountered recently.)
1) I'm a rebel
2) I am irritated by deliberate obfuscations of public data, especially by a source that I suspect is corrupt. Although my extensive analysis has not yet revealed any significant pattern anomalies in their numbers.
3) It's kind of my re-intro into python, which I never made significant progress in but always wanted to.
4) It's literally the real history of all winning numbers since inception. Individuals may have various reasons for accessing this data, but I've been using it to test for manipulation. I presume for most folks it would be curiosity, or gambler's fallacy type stuff. Regardless, it shouldn't be obfuscated.
Do you think the official data published is 100% correct if they were trying to hide something?
I've also compiled a list of all numbers that have never occurred, count of each occurrence and a lot more. My anomaly analytics have included everything, as an ignoramous, I can throw at it; chi squared; isolated forest; time series; and a lot of stuff I don't properly understand. Most anomalies found have been, if narrowly, within expected randomness, but I intend to fortify my proddings eventually. Although I'm actually confident I'm barking up the wrong tree, the data obfuscation is objectively dubious, for whatever the reason.
I appreciate your work, intent, and sharing it. It's very important to appreciate what you're doing and its context when sharing it.
At that point, you are responsible for it, and the choices you make when communicating about it reflect on you.
I've been testing it out on pitch decks made in Figma and saved as JPGs. Surprisingly, the LLM OCR outperformed top dogs like SolidDocuments and PDFtron. Since I'm mainly after getting good context for the LLM from PDFs, I've been using this hybrid setup, bringing in the LLM OCR for pages that need it. In my book, this API is perfect for these kinds of situations.
I know this was an issue when GPT 4 vision initially came out due to training, not sure if it's a solved problem or if your tool handles this.
Let's make some numbers game:
- Average token usage per image: ~1200 - Total tokens per page (including prompt): ~1500 - [GPT4o] Input token cost: $5 per million tokens - [GPT4o] Output token cost: $15 per million tokens
For 1000 documents: - Estimated total cost: $15
This represents excellent value considering the consistency and flexibility provided. For further cost optimization, consider:
1. Utilizing GPT4 mini: Reduces cost to approximately $8 per 1000 documents 2. Implementing batch API: Further reduces cost to around $4 per 1000 documents
I think it offers an optimal balance of affordability & reliability.
PS: One of the most affordable solution on market, cloudconvert charges ~30$ for 1K document (pdftron mode required 4 credits)
It is hard to trust "you" when ChatGPT wrote that text. You never know which part of the answer is genuine and which part was made up by ChatGPT.
To actually answer that question: Pricing varies quite a bit depending on what exactly you want to do with a document.
Text detection generally costs $1.5 per 1k pages:
https://cloud.google.com/vision/pricing
https://aws.amazon.com/textract/pricing/
https://azure.microsoft.com/en-us/pricing/details/ai-documen...
Oh, and if you throw in a line about LaTeX, it'll make things even more consistent. Just add it to that markdown definition part I set up. Honestly, it'll probably work pretty well as is - should be way better than those clunky old OCR systems.
Disclaimer: I'm the founder.
The reason is because these multimodal LLMs can give you descriptions/OCR/etc., but they cannot give you quantifiable information related to placement.
So if there was a picture of a tiger in the middle of the page converted to a bitmap, you couldn't get the LLM to give you something like this: "Image detected at pixel position (120, 200) - (240, 500)." - because that's really want you want.
You almost need segmentation system middleware that the LLM can forward to which can cut out these images to use in markdown syntax:
I won't tell them :) :D >:D :|