Don’t try to sanitize input, escape output (2020) (opens in new tab)

(benhoyt.com)

68 pointsbenhoyt1y ago79 comments

79 comments

We were contacted by a bug hunter once stating he has access to our database and asking for a bounty for his finding, he even provided a sample of first 100 users from the users table in the database.

After some investigating, I figured out how did he obtain the data.

He was one of the first 100 users, he set one of his fields to an xss hunter payload, and slept on it.

After two years, a developer had a dump of data to test some things on, and he loaded the data into an sql development software on his mac, and using his vscode muscle memory, he did a command+shift+p to show the vscode command bar, but on the sql editor it opened "Print Preview", and the software rendered the current table view into a webview to ease the printing, where the xss payload got executed and page content was sent to the researcher.

Escape input, you never know where will it be rendered.

HHad31y ago

Not storing raw HTML might be a last resort to avoid these kinds of bugs in other software, but a good amount of things need to go wrong for them to happen in the first place. The issue is that your data is rendered outside of your software and known-good environment, so all bets are off.

You could as well have triggered a bug in some LaTeX engine that happened to be configured to allow arbitrary shell command execution.

Another strategy to defend against these issue you describe would be to not let developers access raw production data in the first place, but always anonymize it first, or remove internet access from machines accessing production data. (How sensitive is the data in your users table? Could a developer's test script accidentally send emails to your live users?)

orangepanda1y ago

If you dont know where it will be rendered, how do you know what to sanitize for?

KronisLV1y ago

Sanitize it for all XSS. Or better yet, avoid something like HTML or anything that can contain executable instructions, when all you need instead is a regular markup language.

I’ve seen HTML be used for user rich text input and it was an absolute mess, with old data that wasn’t properly sanitized, the sanitization library itself getting outdated, someone putting potentially unsafe content in from another system and so on, whereas people would sometimes bikeshead and worry about breaking old style classes or display of the data across multiple systems instead of addressing just how serious the potential risks are.

Not all of the details here might be accurate, but honestly just use Markdown or something like that for user input, disallow HTML altogether and never use the raw input.

jefftk1y ago

That sounds like a pretty serious bug in the sql development software!

kazinator1y ago

If you don't know where it will be rendered, you have no idea what escape syntax to use. If the field can end up in JSON, CSV, SQL, HTML, ... are you going to try escape it for all of them at once?

This idea of escaping input worse than sanitizing input (what the article says not to do).

1 more reply

simonw1y ago

This is such an important lesson, but it's a difficult one to convince people of - telling people NOT to sanitize their input goes against so much existing thinking and teaching about web application security.

It's worth emphasizing that there's still plenty of scope for sensible input validation. If a field is a number, or one of a known list of items (US States for example) then obviously you should reject invalid data.

But... most web apps end up with some level of free-form text. A comment on Hacker News. A user's bio field. A feedback form.

Filtering those is where things go wrong. You don't want to accidentally create a web development discussion forum where people can't talk about HTML because it gets stripped out of their comments!

toolslive1y ago

An anecdote: I haven't checked recently, but at least until 2015 there was a Belgian govt web site for the register of companies that would yield 0 companies in the city of Aalter. The reason is not hard to find. In fear of SQL injection they would filter out all SQL keywords; in this case "alter".

Dalewyn1y ago

Classic Scunthorpe.

endofreach1y ago

You beat me to it. I couldn't remember the exact term. I just knew the "pattern" – and that it wasn't something that would have made the issue as obvious as eg. the pee-ass-word.

shalmanese1y ago

Don't you mean Clbuttic Scunthorpe?

xxs1y ago

I have never sanitized any input, full respect to whatever text there is.

It's simple - html/xml/javascript/json/url is not text. You render it with whatever tools you have to - and that tools happen not to be concat. You render xml - use DOM, xslt, etc. html - same story, use whatever templating engine you wish. json - use your own model and render it to json. SQL - prepared statements.

zelphirkalt1y ago

I think this misses important aspects. One does not often want to render HTML in place of a user name on a website as HTML, for example.

foota1y ago

It's buried a bit in the article, but if you have to sanitize input to allow only some kinds of inputs (e.g., specific tags), you should really be parsing it fully to an AST and then acting on that (or using a library doing the same) since otherwise you're going to be subject to all sorts of pain.

antonvs1y ago

Which raises another pithy phrase in this area: parse, don't validate.

https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

sameoldtune1y ago

This doesn’t have to be over engineered either! After all an enum is a perfectly valid AST.

WillAdams1y ago

I still wish that the Unicode folks had set up a bunch of duplicate code points which could have been used exclusively for processing marked-up text and that the folks making markup systems/languages had followed through.

Say one was updating TeX to take advantage of this --- all the normal Unicode character points would then have catcodes set to make them appropriate to process as text (or a matching special character), while "processing-marked-up" characters would then be set up so that for example:

- \ (processing-marked-up variant) would work to begin TeX commands

- # (processing-marked-up variant) would work to enumerate macro command arguments

- & (processing-marked-up variant) would work to delineate table columns

&c.

and the matching "normal" characters when encountered would simply be set.

marticode1y ago

Why not not both? Escaping output should be a requirement but doesn't hurt to remove obvious garbage in the input (including harmless stuff like pointless spaces)

sfvisser1y ago

The article clearly states why not: first you probably screw up your data (example is names, but there are many more examples) and second what is garbage depends on your output usage. Html/sql/JSON/etc all require a different sanitization.

dotancohen1y ago

Regarding harmless stuff like pointless spaces.

I know a person who uses two spaces between his first and last name, because his culture users a second given name yet he has none. So, one space between the first given name and the second (nonexistent) given name, then another space between the second (nonexistent) given name and the family name.

You might think it is weird or unnecessary, but that is his identity. One could counter with far weirder or seemingly unnecessary things we accept regarding peoples' identity today.

WillAdams1y ago

Obvious solution is for this person to adopt the "Zero Width Non Joiner" character as their middle name --- that would ensure that this doubled-up-space would be preserved.

dotancohen1y ago

I hate to belittle a point, but your obvious solution is wrong.

For one thing, considering that you know that unicode even exists, then it is likely that you know the difference between NULL and the "Zero Width Non Joiner" character. His second given name is NULL.

Secondly, throwing obscure technical solutions to laymens' problems for which they already have a working solution is nearly always a problem in the making.

cjfd1y ago

The article tell us that input validation is okay but argues against input sanitization. I.e., if there is obvious garbage in the input tell the user that the input is wrong and don't store it and also do not try to correct it. That seems good advice in most cases. Your example of point less spaces is good though. I think it is generally a good idea to remove those.

genocidicbunny1y ago

How do you define what is and isn't garbage in free-form text input though? You can probably come up with some set of rules that covers all the current exceptions, but at what cost in creating and maintaining such a system?

buro91y ago

I store the raw input in my database, but run it through bluemonday before rendering it. Simples.

https://github.com/microcosm-cc/bluemonday

Doxin1y ago

That seems like it's doing a LOT more than what you'd want a sanitizer to do. For HTML you really just want to escape a couple characters with special meaning. You probably can get away with just replacing < and > with their substitution sequences.

That way people can still discuss XSS exploits without your sanitizer deleting a bunch of the text they entered on purpose.

hinkley1y ago

This is another place where 80% of the time one way works but 20% of the time you need to go the other way.

Of course once the product is in production you can swim one direction but not fight the current going in the other. You can always move to escaping output, but retroactively sanitizing input is a giant pain in the ass.

But the problem comes in with your architecture, and whether you can discern data you generated from data the customers generated. Choose the wrong metaphors and you end up with partially formatted data existing halfway up your call stack instead of only at the view layer. And now you really are fucked.

Rails has a cheat for this. It sets a single boolean value on the strings which is meant to indicate the provenance of the string content. If it has already been escaped, it is not escaped again. If you are combining escaped and unescaped data, you have to write your own templating function that is responsible for escaping the unescaped data (or it can lie and create security vulnerabilities. "It's fine! This data will always be clean!" Oh foolish man.)

The better solution is to push the formatting down the stack. But this is a rule that Expediency is particularly fond of breaking.

shaftway1y ago

I've always been a big fan of structuring data on input, escaping it on output.

I think the big problem with just escaping output is that you can accidentally change what the output will actually be in ways that your users can't predict. If I am explaining some HTML in a field and drop `...` in there today, your escaper may escape this properly. But next month when you decide to change your output to actually allow an `` tag, then all of a sudden my comment looks like some italicized dots, which broke it.

Instead if you structure it, and store it in your datastore as a tree of nodes and tags, then next month when you want to support `` you update the input reader to generate the new structure, and the output writer to handle the new tags. You preserve old values while sanitizing or escaping things properly for each platform.

zzo38computer1y ago

It is a reasonable idea, but there are other things that can be done too.

However, in the stuff about SQL, you could use SQL host parameters (usually denoted by question marks) if the database system you use supports it, which can avoid SQL injection problems.

If you deliberately allow the user to enter SQL queries, there are some better ways to handle this. If you use a database system that allows restricting SQL queries (like the authorizer callback and several other functions in SQLite which can be used for this purpose), then you might use that; I think it is better than trying to write a parser for the SQL code which is independent of the database, and expecting it to work. Another alternative is to allow the database (in CSV or SQLite format) to be downloaded (and if the MIME type is set correctly, then it is possible that a browser or browser extension will allow the user to do so using their own user interface if they wish to do so; otherwise, an external program can be used).

Some of the other problems mentioned, and the complexity involved, are due to problems with the messy complexity of HTML and WWW, in general.

For validation, you should of course validate on the back end, and you may do so in the front end too (especially if the data needed for validation is small and is intended to be publicly known). However, if JavaScripts are disabled, then it should still send the form and the server will reply with an error message if the validation fails; if JavaScripts are enabled then it can check for the error before sending it to the server; therefore it will work either way.

chx1y ago

This has been the way for Drupal since ... 2005 at least. My memory becomes fuzzy before that. Since 2015 it's highly automated too thanks to Twig autoescape.

Udo1y ago

They're not even related. Sanitizing input is at best a formatting/style issue. Escaping output is a security issue.

lsb1y ago

Of the “six famous bad ideas in computer security”, the first and second are “default permit” and “enumerating badness”.

http://www.ranum.com/security/computer_security/editorials/d...

josephg1y ago

This isn't “default permit” or “enumerating badness”. Its kind of the opposite.

The idea is that you don't want to store text in your database in a form that is safe when rendered as HTML, JS, JSON, SQL, etc. That would be "enumerating badness". Instead, at the moment you render the text as HTML, you encode the text into an HTML-friendly form (via escape characters). If you want to embed the text into a SQL query, have your SQL library add sql-specific escape characters where needed in the text. Same for your JSON library, and so on.

Its the responsibility of an encoding library to encode and decode text in the appropriate way. A JSON or SQL library should be able to encode then decode any arbitrary unicode string, even one which contains quote characters. Just like how any arbitrary unicode string should be able to be used on a webpage, in a text field without being able to interact with the rest of the page in any way.

Most libraries already do this if used properly. SQL libraries (using parameters) will escape text where needed. React will embed text in an html-safe way. JSON libraries escape quotes in strings. And so on.

KingOfCoders1y ago

I think the challenge is, you share data with other systems. If you don't treat "sharing" as "output" you're in trouble.

cowboylowrez1y ago

yeah I think this is the key takeaway for me, because the sanitation needed depends on the consumer of the output. i was never a successful developer but when i was at least trying, i would constantly run into other devs that would happily code forms that allowed sql injection, even in relatively normal winforms apps, not to mention poor understanding of regular ol concurrency / transactions.

since i had a poor reputation (which i take full responsibility for), my concerns would always be dismissed by a combination of "elitist ivory tower thinking", "toxic interactions" and rebuffed with comebacks like the "database server just handles this" etc etc.

if your comment is anything but solid black for the duration of folks reading it, its just more evidence that the vast majority of developers are just shit at their jobs haha

KingOfCoders1y ago

"my concerns"

Sorry, decades ago I was one of those people.

cowboylowrez1y ago

building software is just a wierd activity for humans i think lol

kazinator1y ago

Of course you should sanitize input, and escape everything properly in the context-specific way.

Defining what is valid for an input field and rejecting everything else helps the user catch mistakes. It's not just for security.

Some kinds of information are tricky to sanitize. Names, addresses and such. Especially in an application or site that has global users. Do the wrong thing and you end up aggravating users, who are not able to input something legitimate.

But maybe don't allow, say, a date field to be "la la la" or even "December 47, 2023".

ww5201y ago

Still looking for a way to safely parse HTML string into DOM while avoiding XSS attacks. Most solutions end up with sanitizing input.

danielheath1y ago

Our approach at work: parse it as HTML, define a short list of known-acceptable tags & attributes, and strip everything else.

Limiting attributes to ["href", "src"] and tags to ["p", "br", "h1", "ul", "ol", "li", "span", "div", "img"] gets you remarkably close to rendering the safe bits of HTML - add to that list upon request.

If you want to take it further, use an `iframe srcdoc=""` with sandbox attributes set.

masklinn1y ago

> Limiting attributes to ["href", "src"]

You need to clean that up as well to avoid e.g. javascript: links, and then there are more issues with SVG if you allow media uploads.

Then you need to be very sure you’re using a proper html5 parser and your rendering is completely canonicalized or you open yourself up to filter evasions (https://cheatsheetseries.owasp.org/cheatsheets/XSS_Filter_Ev...)

And of course I assume that’s what you meant but you should not add upon request, you should evaluate the addition.

danielheath1y ago

Yes - just double checked those, thankfully the framework builtins are correct (staying up to date with a well maintained framework does wonders for your security posture).

furstenheim1y ago

Wasn't there this case of a security issue coming from abusing different parsers, in different places? Server, client, or different browsers

1 more reply

nomercy4001y ago

If you want to avoid XSS attacks, have you tried a CSP header? I know it is more of an output validation, as you restrict what can happen with external scripts.

You can only fit so many characters in your exploit, often due to max field lengths, unless you can load some external script. Disabling loading unknown external scripts with CSP significantly reduces possible attacks, including XSS attacks, because you simply don't have the space.

furstenheim1y ago

There's another one that works 100% of the time.

Do client server rendering. Send HTML, then query backend for content. Something like p.textContent = ... It's safe.

It's pretty much the same as what a prepared statement does in SQL, send data and code in different channels

josephg1y ago

Huh? This is a 100% solved problem in most languages. You just need to replace all of HTML's special characters with their escaped / encoded form. Eg, '<' becomes '<', and so on for &, ", ', >, and all the rest.

There are libraries in almost every language to do this for you. A quick google search found these:

JS: https://github.com/parshap/html-escape

PHP: https://www.php.net/manual/en/function.htmlentities.php

And there are many more.

yen2231y ago

You are right in that it is solved if the goal is "I don't want any part of the string to be treated as HTML"

It's trickier if the goal is "I want to allow and tags in the string to be rendered as bold and italic, but I don't want scripts to execute". It is possible, with things like DOMPurify, but ideally you'd try to avoid this if at all possible.

chii1y ago

> It's trickier if the goal is "I want to allow and tags in the string to be rendered as bold and italic, but I don't want scripts to execute"

yes, because you're no longer allowing HTML, but allowing something similar to HTML but not (and which subset is different for different people/project etc).

So i personally would move up the requirements chain, where the requirement to allow "html" should be scrapped, and instead changed to be something like markdown - a pre-existing formatting protocol that does not have the undesirable aspects.

Or, as an alternative, host the html (without the stripping of "undesirables") in a separate iframe, on a totally different domain, and rely on the browser's cross-origin protection to prevent undesirable scripts or data leaks.

3 more replies

ecjhdnc20251y ago

Ehhh!? I don't get this at all. You obviously do both.

1) you get your input data into the form that is meaningful in the database by validating, sanitising and transforming it. Because you know what form that data should be in, and that's the only form that belongs in your database. Data isn't just output, sometimes it is processed, queried, joined upon.

2) you correctly format/transform it for output formats. Now you know what the normalised form is in the database, you likely have a simpler job to transform it for output.

It's not just lazy to suggest there's a choice here, it's wrong.

growse1y ago

Validating data on input is just checking for "known badness", and is pretty ineffective.

If you've got specific structure requirements for the data you store, parse it into that structure.

ecjhdnc20251y ago

Right -- I would include that in "transform" in my example above.

I've seen too many forum developers spend far too much time after the fact dealing with their decision to "just use TinyMCE" ==> Oh hey, you're a server-side HTML parsing expert now anyway; wasn't that what you were trying to avoid?

atmanactive1y ago

Absolutely the worst advice ever!

dudeinjapan1y ago

Porque no los dos?

TheChaplain1y ago

Disagree.

Escaping/sanitizing on output takes extras cycles/energy that can be spared if the same process is done once upon submission.

Think more sustainable.

oops1y ago

Yes I love seeing < in my database. Every time I see it I think oh boy how many cycles will I save when I display this in HTML!

sameoldtune1y ago

It’s a trade off for sure. But if you know at the time of collection all the ways in which your data is going to be used in the future, you work at a more well-run organization than me!

TheChaplain1y ago

Obviously one thing does not exclude the other, but the more processed data is before, the less has to be done later for every page retrieval.

It surprises me that this seem unfamiliar these days?

kolinko1y ago

That is true only for simple systems that don't change often and that have a single view of the data.

For most real world use cases your approach is break even at best, and often way worse.

1 more reply

cellardweller1y ago

No, that's what caches are for.

viraptor1y ago

What do you escape for? Html? Postscript? Json? SQL? Latex? Custom report generator which haven't been created yet but you'll start using in 2 years? Things move - you can't choose the escaping format at input time.

bruce3434341y ago

Solve problems you have, not ones you invent

viraptor1y ago

Html, Json, SQL are the very basics which most web apps will run into. You don't have to invent anything to get more than one possible output format.

1 more reply

vouwfietsman1y ago

Not sure the escaping/sanitizing proposition can hold a candle to the overwhelming performance dumpster fire that is modern web dev.

carlmr1y ago

I was going to say this sounds like optimizing the stuff that takes 0.1% of runtime for performance over safety.

Of course you'd need to measure this for your application, but without a performance measurement maybe it's better to default to security.

kolinko1y ago

Did you calculate that, or are you just guessing?

ungamedplayer1y ago

The reason you sanitise input is because the data can attack the host and the client.

This post has a narrow view on attackers.

sfvisser1y ago

It’s really not. The point is that sanitizing output is a better way to protect the host and the client.

worthless-trash1y ago

Sure, keep writing exploitable code then, have fun!

cowboylowrez1y ago

yes, and the article clearly agrees with you, different destinations of the data have different escaping requirements.

j / k navigate · click thread line to collapse

79 comments

terraexpert1y ago

After some investigating, I figured out how did he obtain the data.

He was one of the first 100 users, he set one of his fields to an xss hunter payload, and slept on it.

Escape input, you never know where will it be rendered.

HHad31y ago

You could as well have triggered a bug in some LaTeX engine that happened to be configured to allow arbitrary shell command execution.

orangepanda1y ago

If you dont know where it will be rendered, how do you know what to sanitize for?

KronisLV1y ago

Sanitize it for all XSS. Or better yet, avoid something like HTML or anything that can contain executable instructions, when all you need instead is a regular markup language.

Not all of the details here might be accurate, but honestly just use Markdown or something like that for user input, disallow HTML altogether and never use the raw input.

jefftk1y ago

That sounds like a pretty serious bug in the sql development software!

kazinator1y ago

If you don't know where it will be rendered, you have no idea what escape syntax to use. If the field can end up in JSON, CSV, SQL, HTML, ... are you going to try escape it for all of them at once?

This idea of escaping input worse than sanitizing input (what the article says not to do).

1 more reply

simonw1y ago

But... most web apps end up with some level of free-form text. A comment on Hacker News. A user's bio field. A feedback form.

Filtering those is where things go wrong. You don't want to accidentally create a web development discussion forum where people can't talk about HTML because it gets stripped out of their comments!

toolslive1y ago

Dalewyn1y ago

Classic Scunthorpe.

endofreach1y ago

You beat me to it. I couldn't remember the exact term. I just knew the "pattern" – and that it wasn't something that would have made the issue as obvious as eg. the pee-ass-word.

shalmanese1y ago

Don't you mean Clbuttic Scunthorpe?

xxs1y ago

I have never sanitized any input, full respect to whatever text there is.

zelphirkalt1y ago

I think this misses important aspects. One does not often want to render HTML in place of a user name on a website as HTML, for example.

foota1y ago

antonvs1y ago

Which raises another pithy phrase in this area: parse, don't validate.

https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

sameoldtune1y ago

This doesn’t have to be over engineered either! After all an enum is a perfectly valid AST.

WillAdams1y ago

- \ (processing-marked-up variant) would work to begin TeX commands

- # (processing-marked-up variant) would work to enumerate macro command arguments

- & (processing-marked-up variant) would work to delineate table columns

&c.

and the matching "normal" characters when encountered would simply be set.

marticode1y ago

Why not not both? Escaping output should be a requirement but doesn't hurt to remove obvious garbage in the input (including harmless stuff like pointless spaces)

sfvisser1y ago

dotancohen1y ago

Regarding harmless stuff like pointless spaces.

You might think it is weird or unnecessary, but that is his identity. One could counter with far weirder or seemingly unnecessary things we accept regarding peoples' identity today.

WillAdams1y ago

Obvious solution is for this person to adopt the "Zero Width Non Joiner" character as their middle name --- that would ensure that this doubled-up-space would be preserved.

dotancohen1y ago

I hate to belittle a point, but your obvious solution is wrong.

Secondly, throwing obscure technical solutions to laymens' problems for which they already have a working solution is nearly always a problem in the making.

cjfd1y ago

genocidicbunny1y ago

buro91y ago

I store the raw input in my database, but run it through bluemonday before rendering it. Simples.

https://github.com/microcosm-cc/bluemonday

Doxin1y ago

That way people can still discuss XSS exploits without your sanitizer deleting a bunch of the text they entered on purpose.

hinkley1y ago

This is another place where 80% of the time one way works but 20% of the time you need to go the other way.

The better solution is to push the formatting down the stack. But this is a rule that Expediency is particularly fond of breaking.

shaftway1y ago

I've always been a big fan of structuring data on input, escaping it on output.

zzo38computer1y ago

It is a reasonable idea, but there are other things that can be done too.

However, in the stuff about SQL, you could use SQL host parameters (usually denoted by question marks) if the database system you use supports it, which can avoid SQL injection problems.

Some of the other problems mentioned, and the complexity involved, are due to problems with the messy complexity of HTML and WWW, in general.

chx1y ago

This has been the way for Drupal since ... 2005 at least. My memory becomes fuzzy before that. Since 2015 it's highly automated too thanks to Twig autoescape.

Udo1y ago

They're not even related. Sanitizing input is at best a formatting/style issue. Escaping output is a security issue.

lsb1y ago

Of the “six famous bad ideas in computer security”, the first and second are “default permit” and “enumerating badness”.

http://www.ranum.com/security/computer_security/editorials/d...

josephg1y ago

This isn't “default permit” or “enumerating badness”. Its kind of the opposite.

KingOfCoders1y ago

I think the challenge is, you share data with other systems. If you don't treat "sharing" as "output" you're in trouble.

cowboylowrez1y ago

if your comment is anything but solid black for the duration of folks reading it, its just more evidence that the vast majority of developers are just shit at their jobs haha

KingOfCoders1y ago

"my concerns"

Sorry, decades ago I was one of those people.

cowboylowrez1y ago

building software is just a wierd activity for humans i think lol

kazinator1y ago

Of course you should sanitize input, and escape everything properly in the context-specific way.

Defining what is valid for an input field and rejecting everything else helps the user catch mistakes. It's not just for security.

But maybe don't allow, say, a date field to be "la la la" or even "December 47, 2023".

ww5201y ago

Still looking for a way to safely parse HTML string into DOM while avoiding XSS attacks. Most solutions end up with sanitizing input.

danielheath1y ago

Our approach at work: parse it as HTML, define a short list of known-acceptable tags & attributes, and strip everything else.

If you want to take it further, use an `iframe srcdoc=""` with sandbox attributes set.

masklinn1y ago

> Limiting attributes to ["href", "src"]

You need to clean that up as well to avoid e.g. javascript: links, and then there are more issues with SVG if you allow media uploads.

And of course I assume that’s what you meant but you should not add upon request, you should evaluate the addition.

danielheath1y ago

Yes - just double checked those, thankfully the framework builtins are correct (staying up to date with a well maintained framework does wonders for your security posture).

furstenheim1y ago

Wasn't there this case of a security issue coming from abusing different parsers, in different places? Server, client, or different browsers

1 more reply

nomercy4001y ago

If you want to avoid XSS attacks, have you tried a CSP header? I know it is more of an output validation, as you restrict what can happen with external scripts.

furstenheim1y ago

There's another one that works 100% of the time.

Do client server rendering. Send HTML, then query backend for content. Something like p.textContent = ... It's safe.

It's pretty much the same as what a prepared statement does in SQL, send data and code in different channels

josephg1y ago

There are libraries in almost every language to do this for you. A quick google search found these:

JS: https://github.com/parshap/html-escape

PHP: https://www.php.net/manual/en/function.htmlentities.php

And there are many more.

yen2231y ago

You are right in that it is solved if the goal is "I don't want any part of the string to be treated as HTML"

chii1y ago

> It's trickier if the goal is "I want to allow and tags in the string to be rendered as bold and italic, but I don't want scripts to execute"

yes, because you're no longer allowing HTML, but allowing something similar to HTML but not (and which subset is different for different people/project etc).

3 more replies

ecjhdnc20251y ago

Ehhh!? I don't get this at all. You obviously do both.

2) you correctly format/transform it for output formats. Now you know what the normalised form is in the database, you likely have a simpler job to transform it for output.

It's not just lazy to suggest there's a choice here, it's wrong.

growse1y ago

Validating data on input is just checking for "known badness", and is pretty ineffective.

If you've got specific structure requirements for the data you store, parse it into that structure.

ecjhdnc20251y ago

Right -- I would include that in "transform" in my example above.

atmanactive1y ago

Absolutely the worst advice ever!

dudeinjapan1y ago

Porque no los dos?

TheChaplain1y ago

Disagree.

Escaping/sanitizing on output takes extras cycles/energy that can be spared if the same process is done once upon submission.

Think more sustainable.

oops1y ago

Yes I love seeing < in my database. Every time I see it I think oh boy how many cycles will I save when I display this in HTML!

sameoldtune1y ago

It’s a trade off for sure. But if you know at the time of collection all the ways in which your data is going to be used in the future, you work at a more well-run organization than me!

TheChaplain1y ago

Obviously one thing does not exclude the other, but the more processed data is before, the less has to be done later for every page retrieval.

It surprises me that this seem unfamiliar these days?

kolinko1y ago

That is true only for simple systems that don't change often and that have a single view of the data.

For most real world use cases your approach is break even at best, and often way worse.

1 more reply

cellardweller1y ago

No, that's what caches are for.

viraptor1y ago

bruce3434341y ago

Solve problems you have, not ones you invent

viraptor1y ago

Html, Json, SQL are the very basics which most web apps will run into. You don't have to invent anything to get more than one possible output format.

1 more reply

vouwfietsman1y ago

Not sure the escaping/sanitizing proposition can hold a candle to the overwhelming performance dumpster fire that is modern web dev.

carlmr1y ago

I was going to say this sounds like optimizing the stuff that takes 0.1% of runtime for performance over safety.

Of course you'd need to measure this for your application, but without a performance measurement maybe it's better to default to security.

kolinko1y ago

Did you calculate that, or are you just guessing?

ungamedplayer1y ago

The reason you sanitise input is because the data can attack the host and the client.

This post has a narrow view on attackers.

sfvisser1y ago

It’s really not. The point is that sanitizing output is a better way to protect the host and the client.

worthless-trash1y ago

Sure, keep writing exploitable code then, have fun!

cowboylowrez1y ago

yes, and the article clearly agrees with you, different destinations of the data have different escaping requirements.

j / k navigate · click thread line to collapse