I think a much better test would be something like "List of elements by atomic properties" [1] that has a lot of adjacent numbers in a similar range and overlapping first/last column types. However, the danger with that table might be easy for the LLM to infer just from the element names since they're well known physical constants. The table of counties by population density might be less predictable [2] or list of largest cities [3]
The test should be repeated with every available sorting function too, to see if that causes any new errors.
[1] https://en.wikipedia.org/wiki/List_of_elements_by_atomic_pro...
[2] https://en.wikipedia.org/wiki/List_of_countries_and_dependen...
[3] https://en.wikipedia.org/wiki/List_of_largest_cities#List
Instead, using some random, messy, scattered-with-spam site would be a much more realistic test environment.
This isn't a good idea, if you want a fair test. See https://gwern.net/doc/reinforcement-learning/safe/2023-krako..., specifically https://arxiv.org/abs/1712.02950.
To be sure - shouldn't you be asking questions based on data that is guaranteed not to be in it's training?
They say "LLMs are trained on the web", are the web pages converted from HTML into markdown before being fed into training?
If I was not a human but some other kind of being suspended above this situation, with no skin in the game so to speak, it would all seem so terribly inefficient... But as fleshy mortal I do understand how we got here.
if you want the AI to be able to select stuff, give it cheerio or jQuery access to navigate through the html document;
if you need to give tags, classes, and ids to the llm, I use an html-to-pug converter like https://www.npmjs.com/package/html2pug which strips a lot of text and cuts costs. I don't think LLMs are particularly trained on pug content though so take this with a grain of salt
Shouldn't take more than 5 minutes to put together w/ Claude tbh
I use more or less this code as a starting point for a variety of use cases and it seems to work just fine for my use cases (scraping and processing travel blogs which tend to have pretty consistent layouts/structures).
Some variations can make this better by adding logic to look for the `main` content and ignore `nav` and `footer` (or variants thereof whether using semantic tags or CSS selectors) and taking only the `innerText` from the main container.
We ingest your data wherever you point our crawlers and then clean it for work working in RAGs or chained LLMs.
One library we like a lot is Trafilatura [1]. It does a great job of taking the full HTML page and returning the most semantically relevant parts.
It works well for LLM work as well as generating embeddings for vectors and downstream things.
I use it nearly hourly for my HN summarizer HackYourNews (https://hackyournews.com).
There are a few optimizations we can make:
- trip all content in <script/> and <style/> - use Readability.js for articles - extract structured content from oEmbed
It works surprisingly well for me, even with gpt-4o-mini
name=john,age=23
name=anna,age=26
Rather than this: {
matches: [
{ name: "john", age: 23 },
{ name: "anna", age: 26 }
]
}https://github.com/pugjs/pug?tab=readme-ov-file#syntax
It is whitespace sensitive though, but essentially looks like that. I doubt this is the only unique template engine like this though.
https://news.ycombinator.com/item?id=41428274
Edit: looks like it's actually the same author
1. Pretain models with any legal, scraped content. That includes updating existing models with recent data.
2. Have our own private collection of pages we’ve looked at. Then, we can search them with a local engine.
Chain of thought or some similar strategies (I hate that they have their own name and like a paper and authors, lol) can help you push that 0.9 to a 0.95-0.99.
Certainly good enough for gpt input, it's quite good.
By default it will strip all HTML tags and return just the text:
curl 'https://simonwillison.net/' | strip-tags
But you can also tell it you just want to get back the area of a page identified by one or more CSS selectors: curl 'https://simonwillison.net/' | strip-tags .quote
Or you can ask it to keep specific tags if you think those might help provide extra context to the LLM: curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote
Add "-m" to minify the output (basically stripping most whitespace)Running this command:
curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote -m
Gives me back output that starts like this: <div class="quote segment"> <blockquote>history | tail -n
2000 | llm -s "Write aliases for my zshrc based on my
terminal history. Only do this for most common features.
Don't use any specific files or directories."</blockquote> —
anjor #
3:01 pm
/ ai, generative-ai, llms, llm </div>
<div class="quote segment"> <blockquote>Art is notoriously
hard to define, and so are the differences between good art
and bad art. But let me offer a generalization: art is
something that results from making a lot of choices. […] to
oversimplify, we can imagine that a ten-thousand-word short
story requires something on the order of ten thousand
choices. When you give a generative-A.I. program a prompt,
you are making very few choices; if you supply a hundred-word
prompt, you have made on the order of a hundred choices. If
an A.I. generates a ten-thousand-word story based on your
prompt, it has to fill in for all of the choices that you are
not making.</blockquote> — Ted Chiang #
10:09 pm
/ art, new-yorker, ai, generative-ai, ted-chiang </div>
I also often use the https://r.jina.ai/ proxy - add a URL to that and it extracts the key content (using Puppeteer) and returns it converted to Markdown, e.g. https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato...This results in a kind of innerText you get in browsers, great and light to pass into LLMs.
defp extract_inner_text(html) do
html
|> Floki.parse_document!()
|> Floki.find("body")
|> Floki.traverse_and_update(fn
{tag, _attrs, _children} = _node when tag in ["script", "style"] ->
nil
node ->
node
end)
|> Floki.text(sep: " ")
|> String.trim()
|> String.replace(~r/\s+/, " ")
end