Parsing malformed JSON (opens in new tab)

(peteris.rocks)

66 pointsp8donald9y ago54 comments

54 comments

48 comments · 22 top-level

bhaak9y ago· 4 in thread

Great, after the tag soup of modern browsers are we now also going to see json soup?

Sometimes it's obvious what's wrong with malformed data you receive. A classic would be encoding errors.

But as soon as you start supporting broken components and APIs, you will never be able to unsupport it.

Prime example would be HTML. Granted, in the beginning, it was supposed to be written by humans but that was rather quickly not a major obstacle anymore and even a human can produce valid HTML with the help of a syntax checker.

drakenot9y ago

I've written a relatively popular Atom/RSS feed parser for Go [0].

I struggled with this very issue but I ultimately ended up attempting to be robust against out-of-spec feeds. A super strict feed parsing library is less useful than one that can successfully parse certain classes of broken feeds.

It is a fine line to walk -- I won't add a great deal of complexity to support overly broken feeds, but if it is relatively simple to support certain types of common mistakes I'll do it.

[0] https://github.com/mmcdole/gofeed

treve9y ago

I'm doing this with WebDAV too. When I come across a bug that's clearly an implementation problem I weigh how prevalent the software is, how likely they will be able to fix it and if possible I add a user-agent specific workaround so new clients can't rely on the same bug with my server.

1 more reply

markrages9y ago

Nothing new under the sun:

http://www.xml.com/pub/a/2003/01/22/dive-into-xml.html

1 more reply

bhaak9y ago

I know that, a long time ago I wrote an HTML parser that tried to make the most sense out of any HTML you threw at it. At one point, it was used to parse most of the Chinese websites there were at the time to find neologisms.

So it was pretty robust but yeah, somewhere you should draw the line.

I think, as long as it doesn't compromise the design of your program (for example, parsing rfc822 dates with localized weekdays) it's fine to be a bit lenient in what you accept.

Anything that goes beyond, needs a very good reason.

peterkelly9y ago· 3 in thread

Please don't do things like this. It only encourages people to be lazy about producing conforming documents, and different parsers that try to compensate for syntax errors are going to do so in different ways. We learnt this the hard way with HTML.

mwkaufma9y ago

Exactly. When you design for sloppy inputs, you're vulnerable to malicious inputs.

ludamad9y ago

I'm not quite seeing who you think would be encouraged here. Bad JSON output is usually created in a rush by someone who didn't test their output. It's unlikely that someone who does test their JSON output would become lazy because a few lenient parsers exist.

hueving9y ago

Once there are parsers accepting bad input, people will inevitably test with those parsers and assume their output is okay.

captainmuon9y ago

Many people here wondering how you can end up with JSON this bad, and who is "sending" it to them. Well, the poster is not neccessarily running a REST service. At work, I've dealt with plenty of little JSON (and XML) files, created by "little tools" and passed around via files and pipes. Since I work in science, most of our coders are the users of their code, so you can imagine both code quality and UX are poor. And the main reason something like this happens is that people don't use proper serialization, because they never heard of it, or they don't have the right tools. They just construct JSON by string interpolation. If they are lucky, they remember to replace `'` with `\"`. In fact, that looks a lot like what happened here (plus one or two levels of escaping).

Appropos escaping, people are most likely to do this if they never wrote PHP websites as kids and never went through `urlencode` and `mysql_escape_string` hell.

RMarcus9y ago

I wrote a library to handle (many cases) of invalid JSON, motivated by a similar experience. https://github.com/RyanMarcus/dirty-json

I'm on my phone now, but later today I'll test to see if it would have worked for the author. It's good for cleaning up JSON, but I would be weary of putting it (or anything like it) anywhere near production.

k2xl9y ago· 3 in thread

I'm hoping nobody actually does this in production. As an academic exercise it is interesting.

Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.

At minimum, check to make sure it is proper JSON... I know that a lot of stream processors will put it into a queue and 200 right away and then process in the background, but I don't think that ensuring it is at least JSON and doesn't have a content size of more than X could be too intensive.

In this case, if the data was already accepted and you've got no choice but to deal with it, you've gotta do what you got to do. I've been there, and it ain't fun cleaning up a 900 GB JSON file.

amyjess9y ago

> Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.

In some fields, that's not an option. I do NMS engineering. If I need to set up monitoring for something, and the only source of the diagnostics I need is an endpoint that returns malformed JSON, I can't just throw my hands up and say "the data's in a shit format, I won't touch it". I'll have no choice but to get my hands dirty and parse out whatever I can because our systems need to be monitored.

I'm lucky in that the only times I had to deal with malformed JSON at this job, I was able to fix the program that was generating it because it was maintained by my team (the problem was that it was snarfing data from a database and sending it out as JSON but forgetting to escape tab characters, and unescaped tabs aren't allowed in JSON), but my luck's gonna run out some day.

junke9y ago

I don't deal with such huge files. Honestly, what use case requires 900GB of JSON?

hyperman19y ago

I got one for you. We have to upload json files containing for a bunch of articles some encoded rules, and the legal text in the law why the encoded rules are what they are.

The law part was supposed to be a few lines of text. Except when they dont know which article to give. In that case they provide the full law text, including scanned pdfs, base64 encoded. All 2GB of it. Basically you have something with the meaning null, encoded in a huge string.

Now the creation of this file was given to a third party, who don't bother with finding out the relevant law, and paste the 2GB blob into every article they modify, just to be sure. At this point we have 500 000 articles in that file. We get a new one every month.

Not fun at all. But it is modern, at least, in the past it was a cobol flat file.

1 more reply

devy9y ago· 2 in thread

This reminds me "Parsing JSON is a minefield" post a few weeks ago, TL;DR, JSON is not standardized or (having multiple standards) making parsing / validating JSON data very tricky in edge cases.

https://news.ycombinator.com/item?id=12796556

beejiu9y ago

What are the multiple standards of JSON? I am only aware of one standard; it is the implementations that are the problem.

devy9y ago

Read the original article that I linked in the HN discussion, skip through to the section where it says "Yet JSON is defined in at least six different documents". You're welcome.

Analemma_9y ago

How well do you know the sender? Because this looks like an attack, or at least a probe: something to try and crash the parser and see what response they get back, to see if you are vulnerable to some kind of heap corruption attack.

aikah9y ago

> I have no idea how something like this was generated.

It would be interesting to ask the sender how .

> If the file is small enough or the data regular enough, you could fix it by hand with some search & replace.

off course.

> But the file I had was gigabytes in size and most of it looked fine.

I suspect a faulty JSON library, it's important to figure out how it was generated so an eventual issue can be opened and the bug can be fixed.

junke9y ago

> I had this "JSON" file sent to me

Why? By whom? Did you complain loudly?

PaulHoule9y ago· 3 in thread

Malformed data is a scalability problem. Unusual failure modes from coding problems to random bit flips become inevitable as the data volume approaches infinity.

thwd9y ago

Agreed, but the payload from the article doesn't seem to have suffered from astral radiation. Rather, random attempts at quote-escaping by someone who doesn't understand what they're doing. Also notice the "nan" value -- JSON has no concept of NaN.

colanderman9y ago

Yes, and that's a problem to be solved at the transport and storage layers, not the application layer.

reikonomusha9y ago

But to be clear, error correction should be done at a level far lower than the parsing stage. It's usually a property of the storage medium or the firmware that accesses it.

If you have to correct for bit flips when you begin to read or parse data, it's too late.

mSparks9y ago

my first reaction would be to parse until you hit a problem. then use a string distance function and a genetic algorithm to find the problematic characters.

in other words. find multiple possibilities that result in valid a json object and choose the one with the shortest distance.

then, of course log out the changes.

I do something similar with csvs. mssql is notorious for spitting out junk inside csv files.

also, i can guess how it was created.

the code is probably in c, and a rare edge case is overwriting memory before it hits the file.

wccrawford9y ago· 2 in thread

It's a neat trick, but not something I'd deploy into production. If I have to try to guess at what the customer is sending me, I'm not going to apply it to their account.

In an emergency, I might hand-edit it and make it right, but I'd absolutely insist that further files be in the correct format.

k__9y ago

Isn't this used mainly in editors that want to provide some hints even for JSON you didn't finished yet.

wccrawford9y ago

That's a legit use for it, sure. But when a non-techie sees something like this, they immediately think of all the hassle they can save a customer that is having trouble making valid JSON. "We'll just parse it for them!" They completely ignore that it's not possible to know for sure what the customer really wanted, and it's the start of a lot of headaches.

mwkaufma9y ago· 2 in thread

Or, how I made my service a DDoS target.

It's not just the extra compute, it's the lack of a formal specification. If different services applied this kind of ad hoc "postel's principle" they may parse the malformed markup differently, and end up introducing downstream inconsistencies.

hueving9y ago

Or even vulnerabilities. Imagine a scenario where a parser for an authentication engine reads a different value for a given key than the value the authorization logic reads.

brassic9y ago

This isn't theoretical, I've seen it with HTTP, HTML and elsewhere. Any time two pieces of software disagree on how to parse a chunk of data, especially if one of them is supposed to be doing some sort of security check, you should expect to find a vulnerability lurking.

I don't know if there's a name for this class of problem. I'd be interested to know.

latch9y ago· 1 in thread

Not a python developer so I was surprised when the built-in json library has a flag allow_nan which is True by default.

Also, not invalid, but surprising / annoying (a while to debug). An empty lua table is the same as an empty lua array: {}. This causes ambiguity.

    // will print {}
    print(cjson.encode(cjson.decode('[]')))

amyjess9y ago

Another nice feature of the built-in JSON library is that you can choose what class to instantiate with the data. The default is a dict, but if you're trying to parse Avro records (or something else that cares about field order), you can change that to an OrderedDict.

anentropic9y ago

just send the file back where it came from

nommm-nommm9y ago· 4 in thread

Why would a JSON file be GBs in size? I think that's the more interesting question.

nilved9y ago

Because it has GBs of data? There's no size limit on JSON.

mertd9y ago

I think nom-nom is trying to imply that if you're passing GBs of Json around, "human readability" isn't probably a concern. Therefore you could go for an efficient binary format.

1 more reply

bootload9y ago

"Why would a JSON file be GBs in size?"

Maps. [0] Geo-cordinate data can consist of tens of thousands of data points. For example, think of a two dimensional space with a co-ordinate grid at regular intervals representing a 20km x 20km city. Then imagine creating an outline of a city road network. Each point a LAT/LON co-ordinate. Then imagine placing thematic data such as known traffic hot spots.

Lots of data.

[0] The author has this post in his blog ~ https://peteris.rocks/blog/openstreetmap-city-blocks-as-geoj...

nommm-nommm9y ago

Thank you! I was really racking my brain trying to think of a use case that produced that much JSON.

agounaris9y ago

Wasn't easier to just remove the wrong characters manually? :P

Validate the json and if its wrong just throw it away. Makes no sense trying to fix/guess the correct form of an input.

nkrisc9y ago

Should you really assume malformed JSON is even correct?

ekiara9y ago

Wouldn't a better option be an error log? you reply to the client that "I can accept 398,500 of your 400,000 submitted records, attached are the records that do not conform to the expected template. Choose either to (1) submit only the validated records and discard the malformed ones or (2) reformat the malformed records and resubmit the entire batch"

ape49y ago· 2 in thread

JSON should have a nicer way of dealing with double quotes in data. That would avoid many encoding mistakes.

JadeNB9y ago

> JSON should have a nicer way of dealing with double quotes in data. That would avoid many encoding mistakes.

So you update the standard to this nicer way of dealing with double quotes, and now people forget to indicate whether they're using the nice new way or the ugly old way, or they mix the two approaches ….

ape49y ago

It would have to be phased in ... like html5 or any browser improvement.

fbreduc9y ago

i don't get malformed json, i tell the sender to re-send data as json

bborud9y ago

Don't.

j / k navigate · click thread line to collapse

54 comments

48 comments · 22 top-level

bhaak9y ago· 4 in thread

Great, after the tag soup of modern browsers are we now also going to see json soup?

Sometimes it's obvious what's wrong with malformed data you receive. A classic would be encoding errors.

But as soon as you start supporting broken components and APIs, you will never be able to unsupport it.

drakenot9y ago

I've written a relatively popular Atom/RSS feed parser for Go [0].

It is a fine line to walk -- I won't add a great deal of complexity to support overly broken feeds, but if it is relatively simple to support certain types of common mistakes I'll do it.

[0] https://github.com/mmcdole/gofeed

treve9y ago

1 more reply

markrages9y ago

Nothing new under the sun:

http://www.xml.com/pub/a/2003/01/22/dive-into-xml.html

1 more reply

bhaak9y ago

So it was pretty robust but yeah, somewhere you should draw the line.

I think, as long as it doesn't compromise the design of your program (for example, parsing rfc822 dates with localized weekdays) it's fine to be a bit lenient in what you accept.

Anything that goes beyond, needs a very good reason.

peterkelly9y ago· 3 in thread

mwkaufma9y ago

Exactly. When you design for sloppy inputs, you're vulnerable to malicious inputs.

ludamad9y ago

hueving9y ago

Once there are parsers accepting bad input, people will inevitably test with those parsers and assume their output is okay.

captainmuon9y ago

Appropos escaping, people are most likely to do this if they never wrote PHP websites as kids and never went through `urlencode` and `mysql_escape_string` hell.

RMarcus9y ago

I wrote a library to handle (many cases) of invalid JSON, motivated by a similar experience. https://github.com/RyanMarcus/dirty-json

k2xl9y ago· 3 in thread

I'm hoping nobody actually does this in production. As an academic exercise it is interesting.

Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.

In this case, if the data was already accepted and you've got no choice but to deal with it, you've gotta do what you got to do. I've been there, and it ain't fun cleaning up a 900 GB JSON file.

amyjess9y ago

> Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.

junke9y ago

I don't deal with such huge files. Honestly, what use case requires 900GB of JSON?

hyperman19y ago

I got one for you. We have to upload json files containing for a bunch of articles some encoded rules, and the legal text in the law why the encoded rules are what they are.

Not fun at all. But it is modern, at least, in the past it was a cobol flat file.

1 more reply

devy9y ago· 2 in thread

This reminds me "Parsing JSON is a minefield" post a few weeks ago, TL;DR, JSON is not standardized or (having multiple standards) making parsing / validating JSON data very tricky in edge cases.

https://news.ycombinator.com/item?id=12796556

beejiu9y ago

What are the multiple standards of JSON? I am only aware of one standard; it is the implementations that are the problem.

devy9y ago

Read the original article that I linked in the HN discussion, skip through to the section where it says "Yet JSON is defined in at least six different documents". You're welcome.

Analemma_9y ago

aikah9y ago

> I have no idea how something like this was generated.

It would be interesting to ask the sender how .

> If the file is small enough or the data regular enough, you could fix it by hand with some search & replace.

off course.

> But the file I had was gigabytes in size and most of it looked fine.

I suspect a faulty JSON library, it's important to figure out how it was generated so an eventual issue can be opened and the bug can be fixed.

junke9y ago

> I had this "JSON" file sent to me

Why? By whom? Did you complain loudly?

PaulHoule9y ago· 3 in thread

Malformed data is a scalability problem. Unusual failure modes from coding problems to random bit flips become inevitable as the data volume approaches infinity.

thwd9y ago

colanderman9y ago

Yes, and that's a problem to be solved at the transport and storage layers, not the application layer.

reikonomusha9y ago

But to be clear, error correction should be done at a level far lower than the parsing stage. It's usually a property of the storage medium or the firmware that accesses it.

If you have to correct for bit flips when you begin to read or parse data, it's too late.

mSparks9y ago

my first reaction would be to parse until you hit a problem. then use a string distance function and a genetic algorithm to find the problematic characters.

in other words. find multiple possibilities that result in valid a json object and choose the one with the shortest distance.

then, of course log out the changes.

I do something similar with csvs. mssql is notorious for spitting out junk inside csv files.

also, i can guess how it was created.

the code is probably in c, and a rare edge case is overwriting memory before it hits the file.

wccrawford9y ago· 2 in thread

It's a neat trick, but not something I'd deploy into production. If I have to try to guess at what the customer is sending me, I'm not going to apply it to their account.

In an emergency, I might hand-edit it and make it right, but I'd absolutely insist that further files be in the correct format.

k__9y ago

Isn't this used mainly in editors that want to provide some hints even for JSON you didn't finished yet.

wccrawford9y ago

mwkaufma9y ago· 2 in thread

Or, how I made my service a DDoS target.

hueving9y ago

Or even vulnerabilities. Imagine a scenario where a parser for an authentication engine reads a different value for a given key than the value the authorization logic reads.

brassic9y ago

I don't know if there's a name for this class of problem. I'd be interested to know.

latch9y ago· 1 in thread

Not a python developer so I was surprised when the built-in json library has a flag allow_nan which is True by default.

Also, not invalid, but surprising / annoying (a while to debug). An empty lua table is the same as an empty lua array: {}. This causes ambiguity.

    // will print {}
    print(cjson.encode(cjson.decode('[]')))

amyjess9y ago

anentropic9y ago

just send the file back where it came from

nommm-nommm9y ago· 4 in thread

Why would a JSON file be GBs in size? I think that's the more interesting question.

nilved9y ago

Because it has GBs of data? There's no size limit on JSON.

mertd9y ago

I think nom-nom is trying to imply that if you're passing GBs of Json around, "human readability" isn't probably a concern. Therefore you could go for an efficient binary format.

1 more reply

bootload9y ago

"Why would a JSON file be GBs in size?"

Lots of data.

[0] The author has this post in his blog ~ https://peteris.rocks/blog/openstreetmap-city-blocks-as-geoj...

nommm-nommm9y ago

Thank you! I was really racking my brain trying to think of a use case that produced that much JSON.

agounaris9y ago

Wasn't easier to just remove the wrong characters manually? :P

Validate the json and if its wrong just throw it away. Makes no sense trying to fix/guess the correct form of an input.

nkrisc9y ago

Should you really assume malformed JSON is even correct?

ekiara9y ago

ape49y ago· 2 in thread

JSON should have a nicer way of dealing with double quotes in data. That would avoid many encoding mistakes.

JadeNB9y ago

> JSON should have a nicer way of dealing with double quotes in data. That would avoid many encoding mistakes.

ape49y ago

It would have to be phased in ... like html5 or any browser improvement.

fbreduc9y ago

i don't get malformed json, i tell the sender to re-send data as json

bborud9y ago

Don't.

j / k navigate · click thread line to collapse