Sometimes it's obvious what's wrong with malformed data you receive. A classic would be encoding errors.
But as soon as you start supporting broken components and APIs, you will never be able to unsupport it.
Prime example would be HTML. Granted, in the beginning, it was supposed to be written by humans but that was rather quickly not a major obstacle anymore and even a human can produce valid HTML with the help of a syntax checker.
I struggled with this very issue but I ultimately ended up attempting to be robust against out-of-spec feeds. A super strict feed parsing library is less useful than one that can successfully parse certain classes of broken feeds.
It is a fine line to walk -- I won't add a great deal of complexity to support overly broken feeds, but if it is relatively simple to support certain types of common mistakes I'll do it.
So it was pretty robust but yeah, somewhere you should draw the line.
I think, as long as it doesn't compromise the design of your program (for example, parsing rfc822 dates with localized weekdays) it's fine to be a bit lenient in what you accept.
Anything that goes beyond, needs a very good reason.
Appropos escaping, people are most likely to do this if they never wrote PHP websites as kids and never went through `urlencode` and `mysql_escape_string` hell.
I'm on my phone now, but later today I'll test to see if it would have worked for the author. It's good for cleaning up JSON, but I would be weary of putting it (or anything like it) anywhere near production.
Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.
At minimum, check to make sure it is proper JSON... I know that a lot of stream processors will put it into a queue and 200 right away and then process in the background, but I don't think that ensuring it is at least JSON and doesn't have a content size of more than X could be too intensive.
In this case, if the data was already accepted and you've got no choice but to deal with it, you've gotta do what you got to do. I've been there, and it ain't fun cleaning up a 900 GB JSON file.
In some fields, that's not an option. I do NMS engineering. If I need to set up monitoring for something, and the only source of the diagnostics I need is an endpoint that returns malformed JSON, I can't just throw my hands up and say "the data's in a shit format, I won't touch it". I'll have no choice but to get my hands dirty and parse out whatever I can because our systems need to be monitored.
I'm lucky in that the only times I had to deal with malformed JSON at this job, I was able to fix the program that was generating it because it was maintained by my team (the problem was that it was snarfing data from a database and sending it out as JSON but forgetting to escape tab characters, and unescaped tabs aren't allowed in JSON), but my luck's gonna run out some day.
The law part was supposed to be a few lines of text. Except when they dont know which article to give. In that case they provide the full law text, including scanned pdfs, base64 encoded. All 2GB of it. Basically you have something with the meaning null, encoded in a huge string.
Now the creation of this file was given to a third party, who don't bother with finding out the relevant law, and paste the 2GB blob into every article they modify, just to be sure. At this point we have 500 000 articles in that file. We get a new one every month.
Not fun at all. But it is modern, at least, in the past it was a cobol flat file.
It would be interesting to ask the sender how .
> If the file is small enough or the data regular enough, you could fix it by hand with some search & replace.
off course.
> But the file I had was gigabytes in size and most of it looked fine.
I suspect a faulty JSON library, it's important to figure out how it was generated so an eventual issue can be opened and the bug can be fixed.
Why? By whom? Did you complain loudly?
If you have to correct for bit flips when you begin to read or parse data, it's too late.
in other words. find multiple possibilities that result in valid a json object and choose the one with the shortest distance.
then, of course log out the changes.
I do something similar with csvs. mssql is notorious for spitting out junk inside csv files.
also, i can guess how it was created.
the code is probably in c, and a rare edge case is overwriting memory before it hits the file.
In an emergency, I might hand-edit it and make it right, but I'd absolutely insist that further files be in the correct format.
It's not just the extra compute, it's the lack of a formal specification. If different services applied this kind of ad hoc "postel's principle" they may parse the malformed markup differently, and end up introducing downstream inconsistencies.
I don't know if there's a name for this class of problem. I'd be interested to know.
Also, not invalid, but surprising / annoying (a while to debug). An empty lua table is the same as an empty lua array: {}. This causes ambiguity.
// will print {}
print(cjson.encode(cjson.decode('[]')))Maps. [0] Geo-cordinate data can consist of tens of thousands of data points. For example, think of a two dimensional space with a co-ordinate grid at regular intervals representing a 20km x 20km city. Then imagine creating an outline of a city road network. Each point a LAT/LON co-ordinate. Then imagine placing thematic data such as known traffic hot spots.
Lots of data.
[0] The author has this post in his blog ~ https://peteris.rocks/blog/openstreetmap-city-blocks-as-geoj...
Validate the json and if its wrong just throw it away. Makes no sense trying to fix/guess the correct form of an input.
So you update the standard to this nicer way of dealing with double quotes, and now people forget to indicate whether they're using the nice new way or the ugly old way, or they mix the two approaches ….