How to safely escape JSON inside HTML SCRIPT elements (opens in new tab)

(sirre.al)

86 pointsdmsnell9mo ago46 comments

46 comments

I would say avoid trying to understand arcane nuances better than the adversary. Assume they've simultaneously got more time on their hands and sat on the relevant standards committees. Adopt a strategy that's robust to having missed a small nuance in the standard or in the particular implementation by this or that browser. (That doesn't mean there isn't value in a blog post enumerating the edge cases, of course.)

Kaminsky described a very simple and nearly-universal technique to deal with escaping/injection issues. Encode the embedded data as base64 and decode it on the client side. This projects arbitrary data into a fixed, known domain (generally `[a-zA-Z0-9+/]*`) which you can ensure is free from control characters. (You may need to use a particular variant to achieve this, eg for URLs the last characters used are generally `-_` because both + and / are significant in that context.)

After decoding, you can pass it to JSON.parse().

Dylan168079mo ago

To me, escaping < for web stuff is just as non-arcane and non-nuanced as base64.

And yeah use URL-safe base64 when you do use it. -_ with no padding.

maxbond9mo ago

Yeah, that's fair, and I did forget about `=`/padding when I discussed base64. This instance is a solved problem with a simple solution, blessed by the standards body.

The advantage of the base64 technique is that it provides fewer degrees of freedom, and so is more robust to unforseen vectors of attack. It's defensive programming. But it comes at a cost of memory/bandwidth.

comex9mo ago

If you're evaluating JSON as JavaScript, you also need to make sure none of the objects have a key named "__proto__", or else you can end up with some strange results.

(This is related to the 'prototype pollution' attack, although searching that phrase will mostly give you information about the more-dangerous variant where two objects are being merged together with some JS library. If __proto__ is just part of a literal, the behavior is not as dangerous, but still surprising.)

o11c9mo ago

But note that there's also `<script type="application/json">` these days (usually only useful with `id=`) ... and `importmap` I guess.

themafia9mo ago

It's even more general:

    type

    This attribute indicates the type of script represented. The value of this attribute will be one of the following:

    [...]

    Any other value
    
    The embedded content is treated as a data block, and won't be processed by the browser. Developers must use a valid MIME type that is not a JavaScript MIME type to denote data blocks. All of the other attributes will be ignored, including the src attribute.

Although 'importmap' has specific functionality, as does 'speculationrules', although they operate similarly. My favorite is type="module" which competes with the higher level attribute nomodule="true". Anyways it looks like <script> has taken a lot of abuse over the years:

https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/...

masklinn9mo ago

> My favorite is type="module" which competes with the higher level attribute nomodule="true". Anyways it looks like <script> has taken a lot of abuse over the years:

It "conflicts" in the same way noscript[1] and script "conflict" no? They're basically related features, but can't really be made exclusive because the mere act of trying to do so wouldn't work: as the link indicates, executing code in a !module browser reserves the type (requires a specific set of types) so you can't use that as a way to opt in !module browsers.

[1] an other fun element with wonky parsing rules besides

1 more reply

minitech9mo ago

Yes, that option is the real “just do this”.

- escape `<` as `\u003c`

  <script id="my-json" type="application/json">{{ escaped_json }}</script>

  JSON.parse(document.getElementById('my-json').textContent)

No __proto__ issue, and no dynamic code at all, so you can use a strict CSP.

jgalt2129mo ago

Why does the author ignore this method? Django docs show this as a best practice via a built in tag.

pwdisswordfishz9mo ago

Or you can use JSON.parse with a string literal on the client side. Which is, surprisingly, more performant than parsing at compile time.

https://www.youtube.com/watch?v=ff4fgQxPaO0

pastureofplenty9mo ago

This reminded me of how in the early 2000s I was taught to enclose the content of SCRIPT tags in HTML comments, e.g.

  <script language="JavaScript"><!--
  
  // script contents

  -->

  </script>

dmsnellOP9mo ago

Discussing why parsing HTML SCRIPT elements is so complicated, the history of why it became the way it is, and how to safely and securely embed JSON content inside of a SCRIPT element today.

dmsnellOP9mo ago

This was my first submission, and the above comment was what I added to the text box. It wasn’t clear to me what the purpose was, but it seemed like it would want an excerpt. I only discovered after submitting that it created this comment.

I guess people just generally don’t add those?

Still, to help me out, could someone clarify why this was down-voted? I don’t want to mess up again if I did, but I don't understand what that was.

shakna9mo ago

> Leave url blank to submit a question for discussion. If there is no url, text will appear at the top of the thread. If there is a url, text is optional.

Most people will opt for text to be optional with a link - unless they're showing their own product (Show HN). Because there is an expectation that you will attempt to read an article, before conversing about it.

bawolff9mo ago

I think its just because as a comment it looks pretty random and somewhat off topic since its a summary of the article instead of an opinion on it.

I think most of the time people dont add a comment to submissions, but if they do its more of the form: I found X interesting because of [insert non obvious reason why X is interesting] or some additional non-obvious context needed.

In any case, i don't think there is any reason to worry too much. There was no ill intent and at the end of the day its all just fake internet points.

flomo9mo ago

I don't know, but I see early posts which look like AI bot summaries (presumably to collect karma). Probably not necessary for a link.

westurner9mo ago

What about CDATA; which XML and XHTML support? HTML5 does not support CDATA.

CDATA: https://en.wikipedia.org/wiki/CDATA

  <![CDATA[
  ]]>

This would work for XHTML but not HTML5 IIUC:

  <script>
  <![CDATA[
  x = {"<!--":""};
  ]]>

  <![CDATA[
  {{json.dumps(["<!--"])}}
  ]]>
  </script>

dullcrisp9mo ago

Wait can someone explain why a script tag inside a comment inside a script tag needs to be closed, while a script tag inside a script tag without a comment does not? They explained why comments inside script tags are a thing, but nothing further than that.

dmsnellOP9mo ago

The other comment explains this, but I think it can also be viewed differently.

It’s helpful to recognize that the inner script tags are not actual script tags. Yes, once entering a script element, the browser switches parsers and wants to skip everything until a closing script tag appears. The STYLE element, TITLE, TEXTAREA, and a few others do this. Once they chop up the HTML like this they send the contents to the separate inner parser (in this case, the JS engine). SCRIPT is unique due to the legacy behavior^1.

HTML5 specifies these “inner” tags as transitions into escape modes. The entire goal is to allow JavaScript to contain the string “</script>” without it leaking to the outer parser. The early pattern of hiding inside an HTML comment is what determined the escaping mechanism rather than making some special syntax (which today does exist as noted in the post).

The opening script tag inside the comment is actually what triggers the escaping mode, and so it’s less an HTML tag and more some kind of pseudo JS syntax. The inner closing tag is therefore the escaped string value and simultaneously closes the escaped mode.

Consider the use of double quotes inside a string. We have to close the outer quote, but if the inner quote is escaped like `\”` then we don’t have to close it — it’s merely data and not syntax.

There is only one level of nesting, and eight opening tags would still be “closed” by the single closing tag.

^1: (edit) This is one reason HTML and XML (XHTML) are incompatible. The content of SCRIPT and STYLE elements are essentially just bytes. In XML they must be well-formed markup. XML parsers cannot parse HTML.

tannhaeuser9mo ago

Whoever the idiot was who came up with piling inline CSS and JS into the already heavy SGML syntax of HTML should've considered his career choices. It would've be perfectly adequate to require script and CSS to be put into external "resources" linked via src/href, especially since the spec proposals operated under the assumption there would be multiple script and styling languages going forward (like, hey, if we have one markup and styling language, why not have two or multiple?). When in fact the rules were quite simple: in SGML, text rendered to the reader goes into content, everything else, including formatting properties, goes into atttibutes. The reason for introducing this inlining misfeature was probably the desire to avoid network roundtrip, which would've later been made bogusly obsolete by Google's withdrawn HTTP/2 push spec, but also the bizarre idea anyone except webdev bloggers would be editing HTML+CSS by hand. To think there was a committee overviewing such blunders as "W3C recommendations" - actually, they screwed up again with CSS when they allowed unencoded inline data URLs such as used for SVG backgrounds and the like. The alarm bells should've been ringing at the latest the moment they seriously considered storing markup within CSS like with the abovementioned misfeature but also with the "content:" CSS property. You know, as in "recommendation" which is how W3C final stage specs were called.

socalgal29mo ago

All of those are features, not bugs and I'm glad they are there. Uploading and dealing with 1 file is much nicer than dealing with several.

1 more reply

robocat9mo ago

> It would've be perfectly adequate to require script and CSS to be put into external "resources" linked via src/href

Bullshit - Navigator and IE didn't have HTTP/2. I'm guessing you didn't use dialup where your external CSS or JavaScript regularly failed to load. You didn't add extra dependencies because IE would only had two concurrent connections to load files.

It's easy to criticize past mistakes from your armchair: but I suggest you try and be a little more fair towards the people that made decisions especially when overall HTML has been a resounding success.

1 more reply

dullcrisp9mo ago

Huh, it’s still confusing to me why they would have this double-escaping behavior only inside an HTML comment. Why not have it always behave one way or the other? At what point did the parsing behavior inside and outside HTML comments split and why?

dmsnellOP9mo ago

At some point I think I read a more complete justification, but I can’t find it now. There is evidence that it came about as a byproduct of the interaction of the HTML parser and JS parsers in early browsers.

In this link we can see the expectation that the HTML comment surrounds a call to document.write() which inserts a new SCRIPT element. The tags are balanced.

https://stackoverflow.com/questions/236073/why-split-the-scr...

In this HTML 4.01 spec, it’s noted to use HTML comments to hide the script contents from render, which is where we start to get the notion of using these to hide markup from display.

https://www.w3.org/TR/html401/interact/scripts.html

Some drafts of the HTML standard attempted to escape differently and didn’t have the double escape state.

https://www.w3.org/TR/2016/WD-html52-20161206/semantics-scri...

My guess is that at some point the parsers looked for balanced tags, as evidenced in the note in the last link above, but then practical issues with improperly-generated scripts led to the idea that a single SCRIPT closing tag ends the escaping. Maybe people were attempting to concatenate script contents wrong and getting stacks of opening tags that were never closed. I don’t know, but I suppose it’s recorded somewhere.

Many things in today’s HTML arose because of widespread issues with how people generated the content. The same is true of XML and XHTML by the way. Early XML mailing lists were full of people parsing XML with naive PERL regular expressions and suggesting that when someone wants to “fix” broken markup, that they do it with string-based find-and-replace.

The main difference is that the HTML spec went in the direction of saying, _if we can agree how to handle these errors then in the face of some errors we can display some content_ and we can all do it in the same way. XML is worse in some regards: certain kinds of errors are still ambiguous and up to the parser to determine how to handle, whether they are non-recoverable or recoverable. For those non-recoverable, the presence of a single error destroys the entire document, like being refused a withdrawal at the bank because you didn’t cross a 7.

At least with HTML5, it’s agreed upon what to do when errors are present and all parsers can produce the same output document; XML parsers routinely handle malformed content and do so in different ways (though most at least provide or default to a strict mode). It’s better than the early web, but not that much better.

AdieuToLogic9mo ago

From the post:

  Everything until the tag closer </script> is inside
  the script element.

And:

  In fact, script tags can contain any language (not 
  necessarily JavaScript) or even arbitrary data. In order to 
  support this behavior, script tags have special parsing 
  rules. For the most part, the browser accepts whatever is 
  inside the script tag until it finds the script close tag 
  </script>.

Note the sentence fragment "even arbitrary data." This explains the second part of your question as to why nested script tags without HTML comments do not require matching closing tags. Similar compatibility hacks exist for other closing tags (search for Chrome closing tags being optional for a fun ride down a rabbit hole).

As to:

  why a script tag inside a comment inside a script tag needs
  to be closed ...

Well, this again is due to maximizing backward compatibility in order to support broken browsers (thanks IE4, you bastard!). As the article states:

  When JavaScript was first introduced, many browsers did not 
  support it. So they would render the content of the script 
  tag – the JavaScript code itself. The normal way to get 
  around that was to put the script into a comment ...

HTH

dullcrisp9mo ago

So did these older browsers also check for the presence of a comment before turning on double-escaping mode?

Or did they always have two levels of script tag escaping but that behavior only got preserved when inside an HTML comment?

No other JavaScript behavior is different inside an HTML comment, and I’m still missing the connection between the HTML comment and the embedded </script> not closing the tag besides that they were two things that older browsers might have done.

asddubs9mo ago

Is there any specific reason to use JSON_UNESCAPED_SLASHES, or is it just because it becomes unnecessary? The article mentions it several times, but never explains why to use it.

chrismorgan9mo ago

> Imagine if script tags required HTML escaping:

There are two situations in which it does.

① XML syntax, which is absolutely still a thing:

  data:application/xhtml+xml,<html xmlns="http://www.w3.org/1999/xhtml"><script>console.log( 1 &gt; 0 &amp;&amp; 0 &lt; 1 )</script></html>

② Inside an SVG <script> element in HTML syntax:

  data:text/html,<svg><script>console.log( 1 &gt; 0 &amp;&amp; 0 &lt; 1 )</script></svg>

TOGoS9mo ago

> Not so fast, things are about to get messy

That ship sailed several paragraphs ago, when <script> got special treatment by the HTML parser. Too bad we couldn't all agree to parse <![CDATA[...]]> consistently, or, you know, just &-escape the text like we do /everywhere else/ in HTML.

forty9mo ago

What's wrong with CDATA? Do you have concrete examples when that would not work?

TOGoS9mo ago

As per the 'special parsing rules for script tags', browsers don't actually treat it as what you'd expect it means.

  <script>console.log("<![CDATA[Hello, this string content in a CDATA section!]]>");</script>

Results in this being output to the console:

  <![CDATA[Hello, this string content in a CDATA section!]]>

Browsers don't do what you intend if you wrap the whole script in CDATA, either. They treat the "<![CDATA[" sequence as literally part of the script! Which of course throws a syntax error.

I tend to use them anyway, as sort of a HTML/XHTML polyglot thing, because deep in my heart I still think HTML should be valid XML:

  <script>/* <![CDATA[ */
     // my script here, and you *still* need to be careful not
     // to include close-script or close-cdata sequences
  /* ]]> */</script>

In summary, the 'special parsing rules for script tags' add a great amount of complexity not just to the parsing code, but for anybody who has to emit markup, especially if different parsers disagree on what kind of escaping rules are active within a given section. Yes, the HTML5 spec codified the neurotypical "I would rather make you guess what I mean than just use the proper words to say it clearly" behavior, so at least browsers agree on it, but it's a mess and a pain to deal with because now you have to remember 1000 exceptions to what would have been simple rules.

pwdisswordfishz9mo ago

> deep in my heart I still think HTML should be valid XML

Never was, never will be. Just write XHTML instead.

wdiamond9mo ago

"<"+"/script>", its a matter of parsing not value

j / k navigate · click thread line to collapse

46 comments

maxbond9mo ago

After decoding, you can pass it to JSON.parse().

Dylan168079mo ago

To me, escaping < for web stuff is just as non-arcane and non-nuanced as base64.

And yeah use URL-safe base64 when you do use it. -_ with no padding.

maxbond9mo ago

Yeah, that's fair, and I did forget about `=`/padding when I discussed base64. This instance is a solved problem with a simple solution, blessed by the standards body.

comex9mo ago

If you're evaluating JSON as JavaScript, you also need to make sure none of the objects have a key named "__proto__", or else you can end up with some strange results.

o11c9mo ago

But note that there's also `<script type="application/json">` these days (usually only useful with `id=`) ... and `importmap` I guess.

themafia9mo ago

It's even more general:

    type

    This attribute indicates the type of script represented. The value of this attribute will be one of the following:

    [...]

    Any other value
    
    The embedded content is treated as a data block, and won't be processed by the browser. Developers must use a valid MIME type that is not a JavaScript MIME type to denote data blocks. All of the other attributes will be ignored, including the src attribute.

https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/...

masklinn9mo ago

> My favorite is type="module" which competes with the higher level attribute nomodule="true". Anyways it looks like <script> has taken a lot of abuse over the years:

[1] an other fun element with wonky parsing rules besides

1 more reply

minitech9mo ago

Yes, that option is the real “just do this”.

- escape `<` as `\u003c`

  <script id="my-json" type="application/json">{{ escaped_json }}</script>

  JSON.parse(document.getElementById('my-json').textContent)

No __proto__ issue, and no dynamic code at all, so you can use a strict CSP.

jgalt2129mo ago

Why does the author ignore this method? Django docs show this as a best practice via a built in tag.

pwdisswordfishz9mo ago

Or you can use JSON.parse with a string literal on the client side. Which is, surprisingly, more performant than parsing at compile time.

https://www.youtube.com/watch?v=ff4fgQxPaO0

pastureofplenty9mo ago

This reminded me of how in the early 2000s I was taught to enclose the content of SCRIPT tags in HTML comments, e.g.

  <script language="JavaScript"><!--
  
  // script contents

  -->

  </script>

dmsnellOP9mo ago

Discussing why parsing HTML SCRIPT elements is so complicated, the history of why it became the way it is, and how to safely and securely embed JSON content inside of a SCRIPT element today.

dmsnellOP9mo ago

I guess people just generally don’t add those?

Still, to help me out, could someone clarify why this was down-voted? I don’t want to mess up again if I did, but I don't understand what that was.

shakna9mo ago

> Leave url blank to submit a question for discussion. If there is no url, text will appear at the top of the thread. If there is a url, text is optional.

bawolff9mo ago

I think its just because as a comment it looks pretty random and somewhat off topic since its a summary of the article instead of an opinion on it.

In any case, i don't think there is any reason to worry too much. There was no ill intent and at the end of the day its all just fake internet points.

flomo9mo ago

I don't know, but I see early posts which look like AI bot summaries (presumably to collect karma). Probably not necessary for a link.

westurner9mo ago

What about CDATA; which XML and XHTML support? HTML5 does not support CDATA.

CDATA: https://en.wikipedia.org/wiki/CDATA

  <![CDATA[
  ]]>

This would work for XHTML but not HTML5 IIUC:

  <script>
  <![CDATA[
  x = {"<!--":""};
  ]]>

  <![CDATA[
  {{json.dumps(["<!--"])}}
  ]]>
  </script>

dullcrisp9mo ago

dmsnellOP9mo ago

The other comment explains this, but I think it can also be viewed differently.

There is only one level of nesting, and eight opening tags would still be “closed” by the single closing tag.

tannhaeuser9mo ago

socalgal29mo ago

All of those are features, not bugs and I'm glad they are there. Uploading and dealing with 1 file is much nicer than dealing with several.

1 more reply

robocat9mo ago

> It would've be perfectly adequate to require script and CSS to be put into external "resources" linked via src/href

1 more reply

dullcrisp9mo ago

dmsnellOP9mo ago

In this link we can see the expectation that the HTML comment surrounds a call to document.write() which inserts a new SCRIPT element. The tags are balanced.

https://stackoverflow.com/questions/236073/why-split-the-scr...

In this HTML 4.01 spec, it’s noted to use HTML comments to hide the script contents from render, which is where we start to get the notion of using these to hide markup from display.

https://www.w3.org/TR/html401/interact/scripts.html

Some drafts of the HTML standard attempted to escape differently and didn’t have the double escape state.

https://www.w3.org/TR/2016/WD-html52-20161206/semantics-scri...

AdieuToLogic9mo ago

From the post:

  Everything until the tag closer </script> is inside
  the script element.

And:

  In fact, script tags can contain any language (not 
  necessarily JavaScript) or even arbitrary data. In order to 
  support this behavior, script tags have special parsing 
  rules. For the most part, the browser accepts whatever is 
  inside the script tag until it finds the script close tag 
  </script>.

As to:

  why a script tag inside a comment inside a script tag needs
  to be closed ...

Well, this again is due to maximizing backward compatibility in order to support broken browsers (thanks IE4, you bastard!). As the article states:

  When JavaScript was first introduced, many browsers did not 
  support it. So they would render the content of the script 
  tag – the JavaScript code itself. The normal way to get 
  around that was to put the script into a comment ...

HTH

dullcrisp9mo ago

So did these older browsers also check for the presence of a comment before turning on double-escaping mode?

Or did they always have two levels of script tag escaping but that behavior only got preserved when inside an HTML comment?

asddubs9mo ago

Is there any specific reason to use JSON_UNESCAPED_SLASHES, or is it just because it becomes unnecessary? The article mentions it several times, but never explains why to use it.

chrismorgan9mo ago

> Imagine if script tags required HTML escaping:

There are two situations in which it does.

① XML syntax, which is absolutely still a thing:

  data:application/xhtml+xml,<html xmlns="http://www.w3.org/1999/xhtml"><script>console.log( 1 &gt; 0 &amp;&amp; 0 &lt; 1 )</script></html>

② Inside an SVG <script> element in HTML syntax:

  data:text/html,<svg><script>console.log( 1 &gt; 0 &amp;&amp; 0 &lt; 1 )</script></svg>

TOGoS9mo ago

> Not so fast, things are about to get messy

forty9mo ago

What's wrong with CDATA? Do you have concrete examples when that would not work?

TOGoS9mo ago

As per the 'special parsing rules for script tags', browsers don't actually treat it as what you'd expect it means.

  <script>console.log("<![CDATA[Hello, this string content in a CDATA section!]]>");</script>

Results in this being output to the console:

  <![CDATA[Hello, this string content in a CDATA section!]]>

Browsers don't do what you intend if you wrap the whole script in CDATA, either. They treat the "<![CDATA[" sequence as literally part of the script! Which of course throws a syntax error.

I tend to use them anyway, as sort of a HTML/XHTML polyglot thing, because deep in my heart I still think HTML should be valid XML:

  <script>/* <![CDATA[ */
     // my script here, and you *still* need to be careful not
     // to include close-script or close-cdata sequences
  /* ]]> */</script>

pwdisswordfishz9mo ago

> deep in my heart I still think HTML should be valid XML

Never was, never will be. Just write XHTML instead.

wdiamond9mo ago

"<"+"/script>", its a matter of parsing not value

j / k navigate · click thread line to collapse