Show HN: HtmlWasher – An HTML cleanup tool (opens in new tab)

(htmlwasher.com)

75 pointsseky8y ago46 comments

46 comments

The washer makes an XHR to /ajax/paste to do the 'washing'.

Seems like this could be done in JavaScript without an XHR, and not send your info to them.

However, https://www.htmlwasher.com/privacy/:

"The Operator may collect the personal data, such as, without limitation, (i) name; (ii) age; (iii) sex; (iv) address; (v) homepage URL address; (vi) telephone number; (vii) email address; (viii) bank account number; as well as (ix) any information relating and relevant to the Services, including, without limitation, opening and administering the Account, or getting feedback for improving the Services."

" In the event that the Operator is involved in a bankruptcy, merger, acquisition, reorganization or sale of assets, your personal data may be sold or transferred as part of that transaction."

strictnein8y ago

I'm guessing that's just a copy/paste from somewhere else.

It does make me wonder what the owners of the top Google results for JSON and XML prettifiers do with that data. The amount of passwords and other private info that gets pasted into those is probably pretty high.

devopsproject8y ago

> I'm guessing that's just a copy/paste from somewhere else.

In then event of a sell off, the new owners won't be asking about the original "intent" of the creator. They will be looking at the contract for ways to make money. The fact that you are paying nothing for this product makes it doubly suspicious.

goodgood8y ago

Pandoc can do this:

  cat tea-dance.html | pandoc --from=html --to=markdown | pandoc --from=markdown --to=html

I learned that from vimcasts.org: http://vimcasts.org/episodes/using-external-filter-commands-...

1 more reply

burnbabyburn8y ago

of this matter I really like https://github.com/mozilla/bleach .

is your project any different aside from the "service oriented" nature? (also I don't see any usage method, if not from the browser)

tyingq8y ago

Htmltidy is popular as well: https://github.com/htacg/tidy-html5

Though, on second glance, it doesn't do what HtmlWasher is doing here...stripping out classes, etc. It just cleans it up, unmatched tags and so forth.

sekyOP8y ago

yes - but this is not an online service - I wanted to build something handy you can use when you are stressed at the office need to reuse some other HTML doc quickly no time to install anything :-)

I also realized this tool/lib exists after doing Html Washer - I am considering to use their lib as an underlying lib for my project

jtraffic8y ago

Bleach also gives at least one use case for sanitizing html:

"Bleach is intended for sanitizing text from untrusted sources."

I think HtmlWasher should have something on the About tab.

sekyOP8y ago

there is link to my company where is my identity - so it is not anonymous :-) but yes - why not, will fill there something more

1 more reply

fredley8y ago

I prefer my files much cleaner: http://search.cpan.org/~dconway/Acme-Bleach-1.150/lib/Acme/B...

sekyOP8y ago

my project does not want to compete with any lib or commandline or other downloadable tool - yes - it's main purpose is convenience

thanks, I will consider using Bleach as an underlying lib / part of my service

tshadwell8y ago

from experience, I wouldn't recommend other than context-aware safe templating systems for html safety in this day and age.

to an even greater extent than templating systems, sanitization systems of this type need to be built by an expert and align perfectly with how browsers parse tags, which is no small feat.

to give more concrete examples, from a few minutes of testing:

<a href="javascript://%0Aalert`xss`">1</a> <- xss on click

<img src=javascript:alert(2)> <- XSS in Opera Mobile, Opera 10, early versions of IE

<img src="/logout"> <- csrf which affects nearly everything built without security knowhow

1 more reply

Continuous8y ago

This is brilliant!

I wrote an HTML file in Microsoft Word. Then uploaded that .html file which had 800 lines. HtmlWasher cleaned up all the file content, the endless meta tags, non sense IE style tags, etc.

sprremix8y ago

>I wrote an HTML file in Microsoft Word

Explain yourself

rbg2468y ago

My experience has been the company lawyer has written the contract in word, then it has to be exported to html and cleaned of all the cruft that gets saved with word docs.

dugluak8y ago

he probably means, saved word document as HTML.

sekyOP8y ago

thanks :-) I even didn't test it on Word's HTML

bluetidepro8y ago

This would be really useful as a service. Send a glob of html to their endpoint, and return what this site does (the cleaned/washed html). As a service, it could be more efficient than doing 1 file at a time on their site. Or better yet, it would be awesome to open source the way this cleans the html. Regardless, awesome site. I could see the use for various scenarios.

jacurtis8y ago

Take a look at HTMLTidy which is an open source library that does exactly this, but without having to send it off as a request, making it faster, cheaper, and more secure.

sekyOP8y ago

thanks for your comment :-) I think rather than using a web service people would use some library like the HTML Tidy. Regarding to opensourcing - I think I will rather use some opensource as a core lib for my project.

paradite8y ago

The first one is a bad idea in terms of security (hijacking and xss), scalability and reliability (see what happened to MathJax CDN and npm left-pad).

donald1238y ago

the idea itself is not bad, but the actual implementation is challenging to be reliable and secure.

DvdGiessen8y ago

Reminds me of a cleaner tool I wrote about 10+ years ago, a huge single God-class which would parse an HTML string, allowed me to do various transformations on the object tree, and rerendered the entire source code in correct and nicely indented XHTML. Back then I had unused server capacity, so I often used it to do compression of dynamically rendered pages from for example message boards. Also allowed me to place a badge bragging about my 100% W3C validator score, since the original software packages often did not produce such clean HTML. :p The code is actually still being run on every pageload for some old sites I never updated much since.

It has a tiny little webinterface a which remains online today on some underpowered server. Doesn't work well with anything except XHTML though. http://htmlcleaner.blackholestudios.nl/

tannhaeuser8y ago

If you're serious about HTML checking and cleanup consider using SGML and my (inofficial) HTML 5.1 DTD [1].

It doesn't do magic (like indentation or removing/simplifying CSS) if that's what you're after, but it gives you straightforward capabilities to filter out script elements, check/suppress event handler attributes and other places where JavaScript can occur maliciously in HTML, enforce presence of HTML elements, etc. Since it's entirely driven by an SGML DTD grammar for HTML it can be customized to death really (for context-dependent filtering, injection prevention, whatever).

[1]: http://sgmljs.net/blog/blog1701.html

richardwhiuk8y ago

"Reduces a HTML document (or fragment) to basic HTML tags and attributes" - meaning what exactly? What counts as a basic attribute?

sekyOP8y ago

Yes, that could be more specific (and I will make it probably configurable), currently it is: <a href>, <body>, <br>, <div>, <h1>, <h2>, <h3>, <h4>, <h5>, <h6>, <head>, <hr>, <html>, <i>, <img src width height>, <li>, <ol>, <p>, <ruby>, <strong>, <table>, <tbody>, <td colspan rowspan>, <th colspan rowspan>, <title>, <tr>

oneeyedpigeon8y ago

It also converts e.g. b to strong.

1 more reply

egfx8y ago

This should be a library or an API, otherwise I don't really see a use for this. Also seems overly aggressive, and there should be some options on what to keep. I see a need to remove JavaScript from HTML but keep events for example.

sekyOP8y ago

ok, yes why not make it customizable

kazinator8y ago

Here is one in C, with flex-generated lexing, for back-end use:

http://www.kylheku.com/cgit/hc/tree/

I used this for allowing HTML in a mailing list e-mails to be incorporated into the web archive. (The archiver is a modified version of Lurker.)

P.S. "wl" stands for "whitelist": what elements are allowed to pass through, and of those, which attributes are allowed to pass through. The condensed "wl" config file is translated into compiled-in static tables by the wl.txr script. No run-time config.

sekyOP8y ago

thanks :-)

kazinator8y ago

I've noticed that it lacks any sort of licensing information. If you use it, let me know; I will slap a BSD two-clause on it.

JustSomeNobody8y ago

What is the use case for this?

paradite8y ago

This is a huge pain in any web apps that deals with arbitrary pasting by users, especially in WYSIWYG rich text editors:

https://github.com/yabwe/medium-editor/blob/master/spec/past...

cbaleanu8y ago

One obvious use case is cleaning framework generated html from some random site.

sekyOP8y ago

e.g. when you are migrating an old website to a new one - that was when I missed such tool in the past

tyingq8y ago

I suppose it might also be useful for tools that generate very dirty html with tons of span elements. Word used to be that way, or ugh.."Frontpage".

hsivonen8y ago

This doesn't appear to use a spec-compliant HTML parser as the first step of the processing. Any tool of this nature created this day and age really should.

sekyOP8y ago

you are right - I will rewrite it to use HTML Tidy lib or something like that and it will be compliant

hsivonen8y ago

Is HTML Tidy spec-compliant these days? It predates the spec. A cursory look at its GitHub repo doesn’t convince me of it now being spec-compliant. E.g. lexer.c doesn’t have indications of being based on the tokenization section of the spec.

redxblood8y ago

But.. why?

sctb8y ago

https://news.ycombinator.com/showhn.html

j / k navigate · click thread line to collapse

46 comments

ElijahLynn8y ago

The washer makes an XHR to /ajax/paste to do the 'washing'.

Seems like this could be done in JavaScript without an XHR, and not send your info to them.

However, https://www.htmlwasher.com/privacy/:

" In the event that the Operator is involved in a bankruptcy, merger, acquisition, reorganization or sale of assets, your personal data may be sold or transferred as part of that transaction."

strictnein8y ago

I'm guessing that's just a copy/paste from somewhere else.

devopsproject8y ago

> I'm guessing that's just a copy/paste from somewhere else.

goodgood8y ago

Pandoc can do this:

  cat tea-dance.html | pandoc --from=html --to=markdown | pandoc --from=markdown --to=html

I learned that from vimcasts.org: http://vimcasts.org/episodes/using-external-filter-commands-...

1 more reply

burnbabyburn8y ago

of this matter I really like https://github.com/mozilla/bleach .

is your project any different aside from the "service oriented" nature? (also I don't see any usage method, if not from the browser)

tyingq8y ago

Htmltidy is popular as well: https://github.com/htacg/tidy-html5

Though, on second glance, it doesn't do what HtmlWasher is doing here...stripping out classes, etc. It just cleans it up, unmatched tags and so forth.

sekyOP8y ago

yes - but this is not an online service - I wanted to build something handy you can use when you are stressed at the office need to reuse some other HTML doc quickly no time to install anything :-)

I also realized this tool/lib exists after doing Html Washer - I am considering to use their lib as an underlying lib for my project

jtraffic8y ago

Bleach also gives at least one use case for sanitizing html:

"Bleach is intended for sanitizing text from untrusted sources."

I think HtmlWasher should have something on the About tab.

sekyOP8y ago

there is link to my company where is my identity - so it is not anonymous :-) but yes - why not, will fill there something more

1 more reply

fredley8y ago

I prefer my files much cleaner: http://search.cpan.org/~dconway/Acme-Bleach-1.150/lib/Acme/B...

sekyOP8y ago

my project does not want to compete with any lib or commandline or other downloadable tool - yes - it's main purpose is convenience

thanks, I will consider using Bleach as an underlying lib / part of my service

tshadwell8y ago

from experience, I wouldn't recommend other than context-aware safe templating systems for html safety in this day and age.

to an even greater extent than templating systems, sanitization systems of this type need to be built by an expert and align perfectly with how browsers parse tags, which is no small feat.

to give more concrete examples, from a few minutes of testing:

<a href="javascript://%0Aalert`xss`">1</a> <- xss on click

<img src=javascript:alert(2)> <- XSS in Opera Mobile, Opera 10, early versions of IE

<img src="/logout"> <- csrf which affects nearly everything built without security knowhow

1 more reply

Continuous8y ago

This is brilliant!

I wrote an HTML file in Microsoft Word. Then uploaded that .html file which had 800 lines. HtmlWasher cleaned up all the file content, the endless meta tags, non sense IE style tags, etc.

sprremix8y ago

>I wrote an HTML file in Microsoft Word

Explain yourself

rbg2468y ago

My experience has been the company lawyer has written the contract in word, then it has to be exported to html and cleaned of all the cruft that gets saved with word docs.

dugluak8y ago

he probably means, saved word document as HTML.

sekyOP8y ago

thanks :-) I even didn't test it on Word's HTML

bluetidepro8y ago

jacurtis8y ago

Take a look at HTMLTidy which is an open source library that does exactly this, but without having to send it off as a request, making it faster, cheaper, and more secure.

sekyOP8y ago

paradite8y ago

The first one is a bad idea in terms of security (hijacking and xss), scalability and reliability (see what happened to MathJax CDN and npm left-pad).

donald1238y ago

the idea itself is not bad, but the actual implementation is challenging to be reliable and secure.

DvdGiessen8y ago

It has a tiny little webinterface a which remains online today on some underpowered server. Doesn't work well with anything except XHTML though. http://htmlcleaner.blackholestudios.nl/

tannhaeuser8y ago

If you're serious about HTML checking and cleanup consider using SGML and my (inofficial) HTML 5.1 DTD [1].

[1]: http://sgmljs.net/blog/blog1701.html

richardwhiuk8y ago

"Reduces a HTML document (or fragment) to basic HTML tags and attributes" - meaning what exactly? What counts as a basic attribute?

sekyOP8y ago

oneeyedpigeon8y ago

It also converts e.g. b to strong.

1 more reply

egfx8y ago

sekyOP8y ago

ok, yes why not make it customizable

kazinator8y ago

Here is one in C, with flex-generated lexing, for back-end use:

http://www.kylheku.com/cgit/hc/tree/

I used this for allowing HTML in a mailing list e-mails to be incorporated into the web archive. (The archiver is a modified version of Lurker.)

sekyOP8y ago

thanks :-)

kazinator8y ago

I've noticed that it lacks any sort of licensing information. If you use it, let me know; I will slap a BSD two-clause on it.

JustSomeNobody8y ago

What is the use case for this?

paradite8y ago

This is a huge pain in any web apps that deals with arbitrary pasting by users, especially in WYSIWYG rich text editors:

https://github.com/yabwe/medium-editor/blob/master/spec/past...

cbaleanu8y ago

One obvious use case is cleaning framework generated html from some random site.

sekyOP8y ago

e.g. when you are migrating an old website to a new one - that was when I missed such tool in the past

tyingq8y ago

I suppose it might also be useful for tools that generate very dirty html with tons of span elements. Word used to be that way, or ugh.."Frontpage".

hsivonen8y ago

This doesn't appear to use a spec-compliant HTML parser as the first step of the processing. Any tool of this nature created this day and age really should.

sekyOP8y ago

you are right - I will rewrite it to use HTML Tidy lib or something like that and it will be compliant

hsivonen8y ago

redxblood8y ago

But.. why?

sctb8y ago

https://news.ycombinator.com/showhn.html

j / k navigate · click thread line to collapse