I wonder if that kind of front end was done in the age of SAX parsers?
Such a well-written paper.
The Jakarta JSON streaming API sort of gets at this (sort of):
https://jakarta.ee/specifications/platform/9/apidocs/jakarta...
The basic interface to a JSON document is something like an iterator, which lets you advance through the document, token by token, and read out values when you encounter them. So if you have an array of objects with x and y fields, you read a start of array, start of object, key "x", first x value, key "y", first y value, end of object, start of object, key "x", second x value, key "y", second y value, end of object, etc. Reading tokens, not anything tree/DOM-like. But there are also methods getObject() and getArray(), which pull a whole structure out of the document from wherever the iterator has got to. So you could read start of array, read object, read object, etc. That lets you process a document incrementally, without having to materialise the whole thing as a tree, but still having a nice tree-like interface at the leaves.
In principle, you could implement getObject() and getArray() in a way which does not eagerly materialise their contents - each node could know a range in a backing buffer, and parse contents on demand. But i don't think implementations actually do this.
Wrapping a tree-like interface round incremental parsing that doesn't require eager parsing or retaining arbitrary amounts of data, and doesn't leak implementation details, sounds astoundingly hard, perhaps even impossible. But then i am not Daniel Lemire. And i have not read the paper.
I don't think they promise this and I suspect this fails to parse some pathological but correct JSON files, eg one that starts with 50 GB of [s.
I though that XPath over SAX was a thing, and xslt was doing sax-like parsing, but turns out I'm wrong. Which is logical considering XPath can refer to previous nodes. That being said, it looks like there is streamable xslt in xslt 3.0, but that looks more niche
If you start having to actually make an effort to fuss with it, then why not consider other formats?
This does have nice backwards compatibility with existing JSON stuff though, and sticking to standards is cool. But msgpack is also pretty nice.
Some would want to move to binary, but it's hard to find an ideal universal binary format.
msgpack doesn't support a binary chunk bigger than 4gb, which is unfortunate. Also the JavaScript library doesn't handle Map vs plain object.
In JSON you could have a 10GB Base64 blob, such as a video, in a string, no problem (from the format side, with a library YMMV).
For one that supports up 64 bit lengths, check out CBOR: https://cbor.io/ With libraries maybe it could be the ideal universal binary format (universal in the same sense JSON is - I've heard it called that). https://www.infoworld.com/article/3222851/what-is-json-a-bet...
I don’t care what behemoths people store in the formats they use but at the point you exceed “message size” the universality of any format is given up on. (Unless your format is designed to act as a database, like say a SQLite file.)
> In JSON you could have a 10GB Base64 blob, such as a video, in a string, no problem
Almost every stdlib json parser would choke on that, for good reason. Once you start adding partiality to a format, you get into tradeoffs with no obvious answers. Streaming giant arrays of objects? Scanning for keys in a map? How to validate duplicate keys without reading the whole file? Heck, just validating generally is now a deferred operation. Point is, it opens up a can of worms, where people argue endlessly about which use-cases are important, and meanwhile interop goes down the drain.
By all means, the stateful streaming / scanning space is both interesting and underserved. God knows we need something. Go build one, perhaps json can be used internally even. But cramming it all inside of json (or any message format) and expecting others to play along is a recipe for (another) fragmentation shitshow, imo.
In this scenario, writing a new, or bundling someone else's json library can significantly improve things.
The devices only needed a sub-range of the XML so I used an XML parser to ignore everything until I got the tag needed then read until the end tag arrived.
This avoided a DOM and the huge amount of memory needed to hold that. It was also significantly faster.
Even sending data that the mobile app doesn’t need raises flags.
{<a header about foos and bars>}
{<foo 1>}
...
{<foo N>}
{<bar 1>}
...
{<bar N>}
It is compatible with streaming, database json columns, code editors.Anyways, it's the best JSON parser I found (in any language), I implemented fastgron (https://github.com/adamritter/fastgron) on top of it because of the on demand library performance.
One problem with the library was that it needed extra padding at the end of the JSON, so it didn't support streaming / memory mapping.
Previously, simdjson only had a DOM model, where the entire document was parsed in one shot.
Also XML has a number of features to care about like attributes as well as elements, and also potentially about schema. It's also needlessly verbose. Even though elements open and close in a stack there isn't a universal "close" tag. That is, if `<Tag1><Tag2></Tag1></Tag2>` is always considered malformed, then why isn't the syntax simply `<Tag1><Tag2></></>`?
XML isn't just a structured data format where close tags always run up against each other and whitespace is insignificant. It's also a descriptive document format which is often hand-authored.
I think the argument is that the close tags being named makes those documents easier for a human author to understand. It certainly is my experience.
Shameless promotion of my beta engine
https://news.ycombinator.com/item?id=39319746 - JSON Parsing: Intel Sapphire Rapids versus AMD Zen 4 - 40 points and 10 comments
Wouldn’t a quote “ also be a structural character? It doesn’t actually represent data, it just delimits the beginning and end of a string.
I get why I’m probably wrong: a string isn’t a structure of chars because that’s not a type in json. The above six are the pieces of the two collections in JSON.
You specify what you're interested in and then the parser calls your callback whenever it reads the part of a large JSON stream that has your key.
https://libwebsockets.org/lws-api-doc-main/html/md_READMEs_R...
This reminds me of oboe.js: https://github.com/jimhigson/oboe.js
What JSON isn’t valid JS?
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
"In fact, since JavaScript does not support bare objects, the simple statement {"k":"v"} will emit an error in JavaScript"
https://medium.com/@ExplosionPills/json-is-not-javascript-5d...
This is kind of a silly, "well technically". Its a valid expression. Its not a valid statement. It is valid javascript in the sense most people mean when asking the question if something is valid javascript.
Hint: you need validation.
In-process validation is required. There are no trusted sources. Your confusing valid json with valid json according to a schema for a specific purpose.
Lazy loading JSON parsers have no nead to exist, at all, ever. This is why they dont exist.
Also you may want to stream-validate it.
A validating parser, that is. The paper clearly indicates that invalid JSON like [1, 1b] will pass, unless your code happens to try and decode the 1b.