After some investigating, I figured out how did he obtain the data.
He was one of the first 100 users, he set one of his fields to an xss hunter payload, and slept on it.
After two years, a developer had a dump of data to test some things on, and he loaded the data into an sql development software on his mac, and using his vscode muscle memory, he did a command+shift+p to show the vscode command bar, but on the sql editor it opened "Print Preview", and the software rendered the current table view into a webview to ease the printing, where the xss payload got executed and page content was sent to the researcher.
Escape input, you never know where will it be rendered.
You could as well have triggered a bug in some LaTeX engine that happened to be configured to allow arbitrary shell command execution.
Another strategy to defend against these issue you describe would be to not let developers access raw production data in the first place, but always anonymize it first, or remove internet access from machines accessing production data. (How sensitive is the data in your users table? Could a developer's test script accidentally send emails to your live users?)
I’ve seen HTML be used for user rich text input and it was an absolute mess, with old data that wasn’t properly sanitized, the sanitization library itself getting outdated, someone putting potentially unsafe content in from another system and so on, whereas people would sometimes bikeshead and worry about breaking old style classes or display of the data across multiple systems instead of addressing just how serious the potential risks are.
Not all of the details here might be accurate, but honestly just use Markdown or something like that for user input, disallow HTML altogether and never use the raw input.
This idea of escaping input worse than sanitizing input (what the article says not to do).
It's worth emphasizing that there's still plenty of scope for sensible input validation. If a field is a number, or one of a known list of items (US States for example) then obviously you should reject invalid data.
But... most web apps end up with some level of free-form text. A comment on Hacker News. A user's bio field. A feedback form.
Filtering those is where things go wrong. You don't want to accidentally create a web development discussion forum where people can't talk about HTML because it gets stripped out of their comments!
It's simple - html/xml/javascript/json/url is not text. You render it with whatever tools you have to - and that tools happen not to be concat. You render xml - use DOM, xslt, etc. html - same story, use whatever templating engine you wish. json - use your own model and render it to json. SQL - prepared statements.
https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...
Say one was updating TeX to take advantage of this --- all the normal Unicode character points would then have catcodes set to make them appropriate to process as text (or a matching special character), while "processing-marked-up" characters would then be set up so that for example:
- \ (processing-marked-up variant) would work to begin TeX commands
- # (processing-marked-up variant) would work to enumerate macro command arguments
- & (processing-marked-up variant) would work to delineate table columns
&c.
and the matching "normal" characters when encountered would simply be set.
I know a person who uses two spaces between his first and last name, because his culture users a second given name yet he has none. So, one space between the first given name and the second (nonexistent) given name, then another space between the second (nonexistent) given name and the family name.
You might think it is weird or unnecessary, but that is his identity. One could counter with far weirder or seemingly unnecessary things we accept regarding peoples' identity today.
That way people can still discuss XSS exploits without your sanitizer deleting a bunch of the text they entered on purpose.
Of course once the product is in production you can swim one direction but not fight the current going in the other. You can always move to escaping output, but retroactively sanitizing input is a giant pain in the ass.
But the problem comes in with your architecture, and whether you can discern data you generated from data the customers generated. Choose the wrong metaphors and you end up with partially formatted data existing halfway up your call stack instead of only at the view layer. And now you really are fucked.
Rails has a cheat for this. It sets a single boolean value on the strings which is meant to indicate the provenance of the string content. If it has already been escaped, it is not escaped again. If you are combining escaped and unescaped data, you have to write your own templating function that is responsible for escaping the unescaped data (or it can lie and create security vulnerabilities. "It's fine! This data will always be clean!" Oh foolish man.)
The better solution is to push the formatting down the stack. But this is a rule that Expediency is particularly fond of breaking.
I think the big problem with just escaping output is that you can accidentally change what the output will actually be in ways that your users can't predict. If I am explaining some HTML in a field and drop `<i>...</i>` in there today, your escaper may escape this properly. But next month when you decide to change your output to actually allow an `<i>` tag, then all of a sudden my comment looks like some italicized dots, which broke it.
Instead if you structure it, and store it in your datastore as a tree of nodes and tags, then next month when you want to support `<i>` you update the input reader to generate the new structure, and the output writer to handle the new tags. You preserve old values while sanitizing or escaping things properly for each platform.
However, in the stuff about SQL, you could use SQL host parameters (usually denoted by question marks) if the database system you use supports it, which can avoid SQL injection problems.
If you deliberately allow the user to enter SQL queries, there are some better ways to handle this. If you use a database system that allows restricting SQL queries (like the authorizer callback and several other functions in SQLite which can be used for this purpose), then you might use that; I think it is better than trying to write a parser for the SQL code which is independent of the database, and expecting it to work. Another alternative is to allow the database (in CSV or SQLite format) to be downloaded (and if the MIME type is set correctly, then it is possible that a browser or browser extension will allow the user to do so using their own user interface if they wish to do so; otherwise, an external program can be used).
Some of the other problems mentioned, and the complexity involved, are due to problems with the messy complexity of HTML and WWW, in general.
For validation, you should of course validate on the back end, and you may do so in the front end too (especially if the data needed for validation is small and is intended to be publicly known). However, if JavaScripts are disabled, then it should still send the form and the server will reply with an error message if the validation fails; if JavaScripts are enabled then it can check for the error before sending it to the server; therefore it will work either way.
http://www.ranum.com/security/computer_security/editorials/d...
The idea is that you don't want to store text in your database in a form that is safe when rendered as HTML, JS, JSON, SQL, etc. That would be "enumerating badness". Instead, at the moment you render the text as HTML, you encode the text into an HTML-friendly form (via escape characters). If you want to embed the text into a SQL query, have your SQL library add sql-specific escape characters where needed in the text. Same for your JSON library, and so on.
Its the responsibility of an encoding library to encode and decode text in the appropriate way. A JSON or SQL library should be able to encode then decode any arbitrary unicode string, even one which contains quote characters. Just like how any arbitrary unicode string should be able to be used on a webpage, in a text field without being able to interact with the rest of the page in any way.
Most libraries already do this if used properly. SQL libraries (using parameters) will escape text where needed. React will embed text in an html-safe way. JSON libraries escape quotes in strings. And so on.
since i had a poor reputation (which i take full responsibility for), my concerns would always be dismissed by a combination of "elitist ivory tower thinking", "toxic interactions" and rebuffed with comebacks like the "database server just handles this" etc etc.
if your comment is anything but solid black for the duration of folks reading it, its just more evidence that the vast majority of developers are just shit at their jobs haha
Defining what is valid for an input field and rejecting everything else helps the user catch mistakes. It's not just for security.
Some kinds of information are tricky to sanitize. Names, addresses and such. Especially in an application or site that has global users. Do the wrong thing and you end up aggravating users, who are not able to input something legitimate.
But maybe don't allow, say, a date field to be "la la la" or even "December 47, 2023".
Limiting attributes to ["href", "src"] and tags to ["p", "br", "h1", "ul", "ol", "li", "span", "div", "img"] gets you remarkably close to rendering the safe bits of HTML - add to that list upon request.
If you want to take it further, use an `iframe srcdoc=""` with sandbox attributes set.
You need to clean that up as well to avoid e.g. javascript: links, and then there are more issues with SVG if you allow media uploads.
Then you need to be very sure you’re using a proper html5 parser and your rendering is completely canonicalized or you open yourself up to filter evasions (https://cheatsheetseries.owasp.org/cheatsheets/XSS_Filter_Ev...)
And of course I assume that’s what you meant but you should not add upon request, you should evaluate the addition.
You can only fit so many characters in your exploit, often due to max field lengths, unless you can load some external script. Disabling loading unknown external scripts with CSP significantly reduces possible attacks, including XSS attacks, because you simply don't have the space.
Do client server rendering. Send HTML, then query backend for content. Something like p.textContent = ... It's safe.
It's pretty much the same as what a prepared statement does in SQL, send data and code in different channels
There are libraries in almost every language to do this for you. A quick google search found these:
JS: https://github.com/parshap/html-escape
PHP: https://www.php.net/manual/en/function.htmlentities.php
And there are many more.
It's trickier if the goal is "I want to allow <strong> and <em> tags in the string to be rendered as bold and italic, but I don't want scripts to execute". It is possible, with things like DOMPurify, but ideally you'd try to avoid this if at all possible.
1) you get your input data into the form that is meaningful in the database by validating, sanitising and transforming it. Because you know what form that data should be in, and that's the only form that belongs in your database. Data isn't just output, sometimes it is processed, queried, joined upon.
2) you correctly format/transform it for output formats. Now you know what the normalised form is in the database, you likely have a simpler job to transform it for output.
It's not just lazy to suggest there's a choice here, it's wrong.
If you've got specific structure requirements for the data you store, parse it into that structure.
I've seen too many forum developers spend far too much time after the fact dealing with their decision to "just use TinyMCE" ==> Oh hey, you're a server-side HTML parsing expert now anyway; wasn't that what you were trying to avoid?
Escaping/sanitizing on output takes extras cycles/energy that can be spared if the same process is done once upon submission.
Think more sustainable.
It surprises me that this seem unfamiliar these days?
Of course you'd need to measure this for your application, but without a performance measurement maybe it's better to default to security.
This post has a narrow view on attackers.