Minifying HTML for GPT-4o: Remove all the HTML tags (opens in new tab)

(blancas.io)

149 pointsedublancas1y ago49 comments

49 comments

I don't think that Mercury Prize table is a representative example because each column has an obviously unique structure that the LLM can key in on: (year) (Single Artist/Album pair) (List of Artist/Album pairs) (image) (citation link)

I think a much better test would be something like "List of elements by atomic properties" [1] that has a lot of adjacent numbers in a similar range and overlapping first/last column types. However, the danger with that table might be easy for the LLM to infer just from the element names since they're well known physical constants. The table of counties by population density might be less predictable [2] or list of largest cities [3]

The test should be repeated with every available sorting function too, to see if that causes any new errors.

[1] https://en.wikipedia.org/wiki/List_of_elements_by_atomic_pro...

[2] https://en.wikipedia.org/wiki/List_of_countries_and_dependen...

[3] https://en.wikipedia.org/wiki/List_of_largest_cities#List

curl-up1y ago

Additionally, using any Wiki page is misleading, as LLMs have seen their format many times during training, and can probably reproduce the original HTML from the stripped version fairly well.

Instead, using some random, messy, scattered-with-spam site would be a much more realistic test environment.

furyofantares1y ago

Also it can get partial credit on some of these questions without feeding in any data at all.

cal851y ago

Good points. But I feel like even with the cities article it could still ‘cheat’ by recognising what the data is supposed to be and filling in the blanks. Does it even need to be real though? What about generating a fake article to use as a test so it can’t possibly recognise the contents? You could even get GPT to generate it, just give it the ‘Largest cities’ HTML and tell it to output identical HTML but with all the names and statistics changed randomly.

wizzwizz41y ago

> You could even get GPT to generate it

This isn't a good idea, if you want a fair test. See https://gwern.net/doc/reinforcement-learning/safe/2023-krako..., specifically https://arxiv.org/abs/1712.02950.

edublancasOP1y ago

thanks a lot for the feedback! you're right, this is much better input data. I'll re-run the code with these tables!

andybak1y ago

Also - is there a chance GPT is relying on it's training data for some questions? i.e. you don't even need to give it the table.

To be sure - shouldn't you be asking questions based on data that is guaranteed not to be in it's training?

zaptrem1y ago

LLMs are trained on Wikipedia (and, since it's high quality open license data, probably repeatedly), so this test is contaminated.

cpursley1y ago

What I do is convert to markdown, that way you still get some semantic structure. Even built an Elixir library for this: https://github.com/agoodway/html2markdown

bearjaws1y ago

Seems to be the most common method I've seen, it makes sense given how well LLMs understand markdown.

wis1y ago

Why do LLMs understand markdown really well? (besides the simple, terse and readable syntax of markdown)

They say "LLMs are trained on the web", are the web pages converted from HTML into markdown before being fed into training?

2 more replies

audessuscest1y ago

I did that with json too, and got better result

beepbooptheory1y ago

You step back and realize: we are thinking about how to best remove some symbols from documents that not a moment ago we were deciding certainly needed to be in there, all to feed a certain kind of symbol machine which has seen all the symbols before anyway, all so we don't pay as much cents for the symbols we know or think we need.

If I was not a human but some other kind of being suspended above this situation, with no skin in the game so to speak, it would all seem so terribly inefficient... But as fleshy mortal I do understand how we got here.

Shadowmist1y ago

It’s all one big piece of tape.

yawnxyz1y ago

I found that reducing html down to markdown using turndown or https://github.com/romansky/dom-to-semantic-markdown works well;

if you want the AI to be able to select stuff, give it cheerio or jQuery access to navigate through the html document;

if you need to give tags, classes, and ids to the llm, I use an html-to-pug converter like https://www.npmjs.com/package/html2pug which strips a lot of text and cuts costs. I don't think LLMs are particularly trained on pug content though so take this with a grain of salt

rcarmo1y ago

Hmmm. That's interesting. I wish there was a Node-RED node for the first library (I can always import the library directly and build my own subflow, but since I have cheerio for Node-RED and use it for paring down input to LLMs already...)

andybak1y ago

But OP did a (admittedly flawed) test. Have you got anything to back up your claim here? We've all got our own hunches but this post was an attempt to test those hypotheses.

yawnxyz1y ago

haha I haven't tested it if it's efficient or anything, I just put it together as a pipeline on val.town and I've been using it for parsing.

Shouldn't take more than 5 minutes to put together w/ Claude tbh

ravedave51y ago

ChatGPT is clearly trained on wikipedia, is there any concern about its knowledge from there polluting the responses? Seems like it would be better to try against data it didn't potentially already know.

tedsanders1y ago

Yep - one good option is to use Wikipedia pages from the recent Olympics, which GPT has no knowledge of: https://github.com/openai/openai-cookbook/blob/457f4310700f9...

CharlieDigital1y ago

I roughly came to the same conclusion a few months back and wrote a simple, containerized, open source general purpose scraper for use with GPT using Playwright in C# and TypeScript that's fairly easy to deploy and use with GPT function calling[0]. My observation was that using `document.body.innerText` was sufficient for GPT to "understand" the page and `document.body.innerText` preserves some whitespace in Firefox (and I think Chrome).

I use more or less this code as a starting point for a variety of use cases and it seems to work just fine for my use cases (scraping and processing travel blogs which tend to have pretty consistent layouts/structures).

Some variations can make this better by adding logic to look for the `main` content and ignore `nav` and `footer` (or variants thereof whether using semantic tags or CSS selectors) and taking only the `innerText` from the main container.

[0] https://github.com/CharlieDigital/playwright-scrape-api

simplecto1y ago

One of my projects is a virtual agency of multiple LLMs for a variety of back-office services (copywriting, copy-editing, social media, job ads, etc).

We ingest your data wherever you point our crawlers and then clean it for work working in RAGs or chained LLMs.

One library we like a lot is Trafilatura [1]. It does a great job of taking the full HTML page and returning the most semantically relevant parts.

It works well for LLM work as well as generating embeddings for vectors and downstream things.

[1] - https://trafilatura.readthedocs.io/en/latest/

ukuina1y ago

+1 for Trafilatura. Simple, no fuss, and rarely breaks.

I use it nearly hourly for my HN summarizer HackYourNews (https://hackyournews.com).

nickpsecurity1y ago

The paper is good, too, for understanding how it works. The author also mentions many related tools in it.

https://aclanthology.org/2021.acl-demo.15/

longnguyen1y ago

I've been building an AI chat client and I use this exact technique to develop the "Web Browsing" plugin. Basically I use Function Calling to extract content from a web page and then pass it to the LLM.

There are a few optimizations we can make:

- trip all content in <script/> and <style/> - use Readability.js for articles - extract structured content from oEmbed

It works surprisingly well for me, even with gpt-4o-mini

coddle-hark1y ago

Anecdotally, the same seems to apply to the output format as well. I’ve seen much better performance when instructing the model to output something like this:

  name=john,age=23
  name=anna,age=26

Rather than this:

  {
    matches: [
      { name: "john", age: 23 },
      { name: "anna", age: 26 }
    ]
  }

audessuscest1y ago

markdown works better than json too

giancarlostoro1y ago

I wonder if this is due to some template engines looking minimalist like that. I think maybe Pug?

https://github.com/pugjs/pug?tab=readme-ov-file#syntax

It is whitespace sensitive though, but essentially looks like that. I doubt this is the only unique template engine like this though.

cj1y ago

Related article from 4 days ago (with comments on scraping, specifically discussing removing HTML tags)

https://news.ycombinator.com/item?id=41428274

Edit: looks like it's actually the same author

cfcfcf1y ago

I’m curious. Scraping seems to come up a lot lately. What is everyone scraping? And why?

nickpsecurity1y ago

To add to others’ points, we can do two, more things:

1. Pretain models with any legal, scraped content. That includes updating existing models with recent data.

2. Have our own private collection of pages we’ve looked at. Then, we can search them with a local engine.

jstanley1y ago

With people making LLMs act as agents in the world, the line between "scraping" and "ordinary web usage" is becoming very blurred.

samrolken1y ago

Context for LLMs, and use cases uniquely enabled by LLMs, mostly I think.

topaz01y ago

Is .8 or .9 considered good enough accuracy for something as simple as this?

edublancasOP1y ago

I'd say how much is good enough highly depends on your use case. For something that still has to be reviewed by a human, I think even .7 is great; if you're planning to automate processes end-to-end, I'd aim for higher than .95

LunaSea1y ago

Well, when "simply" extracting the core text of an article is a task where most solutions (rule-based, visual, traditional classifiers and LLMs) rarely score above 0.8 in precision on datasets with a variety of websites and / or multilingual pages, I would consider that not too bad.

moralestapia1y ago

Yes, because the prompt is simple as well.

Chain of thought or some similar strategies (I hate that they have their own name and like a paper and authors, lol) can help you push that 0.9 to a 0.95-0.99.

kristianp1y ago

I found that strange. Normally 2 significant digits are used, e.g. 74%

b1121y ago

A simple |htmltotext works well here, I suspect. Why rewrite the thing from scratch? It even outputs formatted text if requested.

Certainly good enough for gpt input, it's quite good.

IncreasePosts1y ago

Isn't GPT-4o multimodal? Shouldn't I be able to just feed in an image of the rendered HTML, instead of doing work to strip tags out?

spencerchubb1y ago

it is theoretically possible, but the results and bandwidth would be worse. sending an image that large would take a lot longer than sending text

brookst1y ago

This. Images passed to LLMs are typically downsampled to something like 512x512 because that’s perfectly good for feature extraction. Getting text would mean very large images so the text is still readable.

tedsanders1y ago

Images are much less reliable than text, unfortunately.

simonw1y ago

I built a CLI tool (and Python library) for this a while ago called strip-tags: https://github.com/simonw/strip-tags

By default it will strip all HTML tags and return just the text:

    curl 'https://simonwillison.net/' | strip-tags

But you can also tell it you just want to get back the area of a page identified by one or more CSS selectors:

    curl 'https://simonwillison.net/' | strip-tags .quote

Or you can ask it to keep specific tags if you think those might help provide extra context to the LLM:

    curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote

Add "-m" to minify the output (basically stripping most whitespace)

Running this command:

    curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote -m

Gives me back output that starts like this:

    <div class="quote segment"> <blockquote>history | tail -n
    2000 | llm -s "Write aliases for my zshrc based on my
    terminal history. Only do this for most common features.
    Don't use any specific files or directories."</blockquote> —
    anjor  #
    3:01 pm
    / ai, generative-ai, llms, llm  </div>
    <div class="quote segment"> <blockquote>Art is notoriously
    hard to define, and so are the differences between good art
    and bad art. But let me offer a generalization: art is
    something that results from making a lot of choices. […] to
    oversimplify, we can imagine that a ten-thousand-word short
    story requires something on the order of ten thousand
    choices. When you give a generative-A.I. program a prompt,
    you are making very few choices; if you supply a hundred-word
    prompt, you have made on the order of a hundred choices. If
    an A.I. generates a ten-thousand-word story based on your
    prompt, it has to fill in for all of the choices that you are
    not making.</blockquote> — Ted Chiang  #
    10:09 pm
    / art, new-yorker, ai, generative-ai, ted-chiang  </div>

I also often use the https://r.jina.ai/ proxy - add a URL to that and it extracts the key content (using Puppeteer) and returns it converted to Markdown, e.g. https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato...

sergiotapia1y ago

In Elixir, I select the `<body>`, then remove all script and style tags. Then extract the text.

This results in a kind of innerText you get in browsers, great and light to pass into LLMs.

    defp extract_inner_text(html) do
      html
      |> Floki.parse_document!()
      |> Floki.find("body")
      |> Floki.traverse_and_update(fn
        {tag, _attrs, _children} = _node when tag in ["script", "style"] ->
          nil
  
        node ->
          node
      end)
      |> Floki.text(sep: " ")
      |> String.trim()
      |> String.replace(~r/\s+/, " ")
    end

bufferout1y ago

An example of where this approach is problematic: many ecommerce product pages feature embedded json that is used to dynamically update sections of the page.

j / k navigate · click thread line to collapse

49 comments

throwup2381y ago

The test should be repeated with every available sorting function too, to see if that causes any new errors.

[1] https://en.wikipedia.org/wiki/List_of_elements_by_atomic_pro...

[2] https://en.wikipedia.org/wiki/List_of_countries_and_dependen...

[3] https://en.wikipedia.org/wiki/List_of_largest_cities#List

curl-up1y ago

Additionally, using any Wiki page is misleading, as LLMs have seen their format many times during training, and can probably reproduce the original HTML from the stripped version fairly well.

Instead, using some random, messy, scattered-with-spam site would be a much more realistic test environment.

furyofantares1y ago

Also it can get partial credit on some of these questions without feeding in any data at all.

cal851y ago

wizzwizz41y ago

> You could even get GPT to generate it

This isn't a good idea, if you want a fair test. See https://gwern.net/doc/reinforcement-learning/safe/2023-krako..., specifically https://arxiv.org/abs/1712.02950.

edublancasOP1y ago

thanks a lot for the feedback! you're right, this is much better input data. I'll re-run the code with these tables!

andybak1y ago

Also - is there a chance GPT is relying on it's training data for some questions? i.e. you don't even need to give it the table.

To be sure - shouldn't you be asking questions based on data that is guaranteed not to be in it's training?

zaptrem1y ago

LLMs are trained on Wikipedia (and, since it's high quality open license data, probably repeatedly), so this test is contaminated.

cpursley1y ago

What I do is convert to markdown, that way you still get some semantic structure. Even built an Elixir library for this: https://github.com/agoodway/html2markdown

bearjaws1y ago

Seems to be the most common method I've seen, it makes sense given how well LLMs understand markdown.

wis1y ago

Why do LLMs understand markdown really well? (besides the simple, terse and readable syntax of markdown)

They say "LLMs are trained on the web", are the web pages converted from HTML into markdown before being fed into training?

2 more replies

audessuscest1y ago

I did that with json too, and got better result

beepbooptheory1y ago

Shadowmist1y ago

It’s all one big piece of tape.

yawnxyz1y ago

I found that reducing html down to markdown using turndown or https://github.com/romansky/dom-to-semantic-markdown works well;

if you want the AI to be able to select stuff, give it cheerio or jQuery access to navigate through the html document;

rcarmo1y ago

andybak1y ago

But OP did a (admittedly flawed) test. Have you got anything to back up your claim here? We've all got our own hunches but this post was an attempt to test those hypotheses.

yawnxyz1y ago

haha I haven't tested it if it's efficient or anything, I just put it together as a pipeline on val.town and I've been using it for parsing.

Shouldn't take more than 5 minutes to put together w/ Claude tbh

ravedave51y ago

tedsanders1y ago

Yep - one good option is to use Wikipedia pages from the recent Olympics, which GPT has no knowledge of: https://github.com/openai/openai-cookbook/blob/457f4310700f9...

CharlieDigital1y ago

[0] https://github.com/CharlieDigital/playwright-scrape-api

simplecto1y ago

One of my projects is a virtual agency of multiple LLMs for a variety of back-office services (copywriting, copy-editing, social media, job ads, etc).

We ingest your data wherever you point our crawlers and then clean it for work working in RAGs or chained LLMs.

One library we like a lot is Trafilatura [1]. It does a great job of taking the full HTML page and returning the most semantically relevant parts.

It works well for LLM work as well as generating embeddings for vectors and downstream things.

[1] - https://trafilatura.readthedocs.io/en/latest/

ukuina1y ago

+1 for Trafilatura. Simple, no fuss, and rarely breaks.

I use it nearly hourly for my HN summarizer HackYourNews (https://hackyournews.com).

nickpsecurity1y ago

The paper is good, too, for understanding how it works. The author also mentions many related tools in it.

https://aclanthology.org/2021.acl-demo.15/

longnguyen1y ago

There are a few optimizations we can make:

- trip all content in <script/> and <style/> - use Readability.js for articles - extract structured content from oEmbed

It works surprisingly well for me, even with gpt-4o-mini

coddle-hark1y ago

Anecdotally, the same seems to apply to the output format as well. I’ve seen much better performance when instructing the model to output something like this:

  name=john,age=23
  name=anna,age=26

Rather than this:

  {
    matches: [
      { name: "john", age: 23 },
      { name: "anna", age: 26 }
    ]
  }

audessuscest1y ago

markdown works better than json too

giancarlostoro1y ago

I wonder if this is due to some template engines looking minimalist like that. I think maybe Pug?

https://github.com/pugjs/pug?tab=readme-ov-file#syntax

It is whitespace sensitive though, but essentially looks like that. I doubt this is the only unique template engine like this though.

cj1y ago

Related article from 4 days ago (with comments on scraping, specifically discussing removing HTML tags)

https://news.ycombinator.com/item?id=41428274

Edit: looks like it's actually the same author

cfcfcf1y ago

I’m curious. Scraping seems to come up a lot lately. What is everyone scraping? And why?

nickpsecurity1y ago

To add to others’ points, we can do two, more things:

1. Pretain models with any legal, scraped content. That includes updating existing models with recent data.

2. Have our own private collection of pages we’ve looked at. Then, we can search them with a local engine.

jstanley1y ago

With people making LLMs act as agents in the world, the line between "scraping" and "ordinary web usage" is becoming very blurred.

samrolken1y ago

Context for LLMs, and use cases uniquely enabled by LLMs, mostly I think.

topaz01y ago

Is .8 or .9 considered good enough accuracy for something as simple as this?

edublancasOP1y ago

LunaSea1y ago

moralestapia1y ago

Yes, because the prompt is simple as well.

Chain of thought or some similar strategies (I hate that they have their own name and like a paper and authors, lol) can help you push that 0.9 to a 0.95-0.99.

kristianp1y ago

I found that strange. Normally 2 significant digits are used, e.g. 74%

b1121y ago

A simple |htmltotext works well here, I suspect. Why rewrite the thing from scratch? It even outputs formatted text if requested.

Certainly good enough for gpt input, it's quite good.

IncreasePosts1y ago

Isn't GPT-4o multimodal? Shouldn't I be able to just feed in an image of the rendered HTML, instead of doing work to strip tags out?

spencerchubb1y ago

it is theoretically possible, but the results and bandwidth would be worse. sending an image that large would take a lot longer than sending text

brookst1y ago

tedsanders1y ago

Images are much less reliable than text, unfortunately.

simonw1y ago

I built a CLI tool (and Python library) for this a while ago called strip-tags: https://github.com/simonw/strip-tags

By default it will strip all HTML tags and return just the text:

    curl 'https://simonwillison.net/' | strip-tags

But you can also tell it you just want to get back the area of a page identified by one or more CSS selectors:

    curl 'https://simonwillison.net/' | strip-tags .quote

Or you can ask it to keep specific tags if you think those might help provide extra context to the LLM:

    curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote

Add "-m" to minify the output (basically stripping most whitespace)

Running this command:

    curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote -m

Gives me back output that starts like this:

    <div class="quote segment"> <blockquote>history | tail -n
    2000 | llm -s "Write aliases for my zshrc based on my
    terminal history. Only do this for most common features.
    Don't use any specific files or directories."</blockquote> —
    anjor  #
    3:01 pm
    / ai, generative-ai, llms, llm  </div>
    <div class="quote segment"> <blockquote>Art is notoriously
    hard to define, and so are the differences between good art
    and bad art. But let me offer a generalization: art is
    something that results from making a lot of choices. […] to
    oversimplify, we can imagine that a ten-thousand-word short
    story requires something on the order of ten thousand
    choices. When you give a generative-A.I. program a prompt,
    you are making very few choices; if you supply a hundred-word
    prompt, you have made on the order of a hundred choices. If
    an A.I. generates a ten-thousand-word story based on your
    prompt, it has to fill in for all of the choices that you are
    not making.</blockquote> — Ted Chiang  #
    10:09 pm
    / art, new-yorker, ai, generative-ai, ted-chiang  </div>

sergiotapia1y ago

In Elixir, I select the `<body>`, then remove all script and style tags. Then extract the text.

This results in a kind of innerText you get in browsers, great and light to pass into LLMs.

    defp extract_inner_text(html) do
      html
      |> Floki.parse_document!()
      |> Floki.find("body")
      |> Floki.traverse_and_update(fn
        {tag, _attrs, _children} = _node when tag in ["script", "style"] ->
          nil
  
        node ->
          node
      end)
      |> Floki.text(sep: " ")
      |> String.trim()
      |> String.replace(~r/\s+/, " ")
    end

bufferout1y ago

An example of where this approach is problematic: many ecommerce product pages feature embedded json that is used to dynamically update sections of the page.

j / k navigate · click thread line to collapse