That said, I will say one thing: you don't really need to have an explicit tokenizer for JSON. You can get rid of the concept of tokens and integrate parsing and tokenization entirely. This is what I usually do since it makes everything simpler. This is a lot harder to do with something like the rest of ECMAscript since in something like ECMAscript you wind up needing look-ahead (sometimes arbitrarily large look-ahead... consider arrow functions: it's mostly a subset of the grammar of a parenthesized expression. Comma is an operator, and for default values, equal is an operator. It isn't until the => does or does not appear that you know for sure!)
Handling large documents is indeed another big one. It sort-of fits in the same category as being able to parse incrementally. That said, Go has a JSON scanner you can sort of use for incremental parsing, but in practice I've found it to be a lot slower, so for large documents it's a problem.
I've done a couple in hobby projects too. One time I did a partial one in Win32-style C89 because I wanted one that didn't depend on libc.
In another instance, it was easier to parse into some application-specific structures, skipping the whole intermediate generic step (for performance reasons).
With JSON it's easier to convince your boss that you can actually write such a parser because the language is relatively simple (if you overlook botched definitions of basically every element...) So, for example, if the application that uses JSON is completely under your control, you may take advantage of stupid decisions made by JSON authors to simplify many things. More concretely, you can decide that there will never be more than X digits in numbers. That you will never use "null". Or that you will always put elements of the same type into "lists". Or that you will never repeat keys in "hash tables".
I've had problems with handling streams like the OP on basically every programing language and data-encoding language pair that I've tried. It looks like nobody ever thinks about it (I do use chunking any time I can, but some times you can't).
There are probably lots and lots of reasons to write your own parser.
If you're going for pure performance in a production environment you might take a look at Daniel Lemire's work: https://github.com/simdjson/simdjson. Or the MinIO port of it to Go: https://github.com/minio/simdjson-go.
Perhaps some macro-ridden Rust monstrosity that spits out specialised parsers at compile time, dynamically…
Avoid heap allocations in tokenising. Have a tokeniser that is a function that returns a stack-allocated struct or an int64 token that is a packed field describing the start, length and type offsets etc of the token.
Avoid heap allocations in parsing: support a getString(key String) type interface for clients that what to chop up a buffer.
For deserialising to object where you know the fields at compile time, generally generate a switch case of key length before comparing string values.
My experience in data pipelines that process lots of json is that choice of json library can be a 3-10x performance difference and that all the main parsers want to allocate objects.
If the classes you are serialising or deserialising is known at compile time then Jackson Java does a good job but you can get a 2x boost with careful coding and profiling.
Whereas if you are paying aribrary json then all the mainstream parsers want to do lots of allocations that a more intrusive parser that you write yourself can avoid, and that you can make massive performance wins if you are processing thousands or millions of objects per second.
I've held a talk about this, unfortunately wasn't recorded. I've tried to squeeze as much out of Go as I could and I've went crazy doing that :D
Or does this bite us in other ways too?
I came up with what I think is a kind of neat solution, which is that the tokenizer is generic over some T and takes a function from byteslice to T and uses T in place of the strings. This way, when the caller has some more efficient representation available (like one that allocates less) it can provide one, but I can still unit test the tokenizer with the identity function for convenience.
In a sense this is like fusing the tokenizer with the parser at build time, but the generic allows layering the tokenizer such that it doesn't know about the parser's representation.
In my mind, JSON is inherently inadequate for streaming because of hierarchical structure, no length know upfront and most importantly, repeating keys. You could probably make a subset of JSON more streaming-friendly, but at this point, why bother? I mean, if the solution is to modify JSON, then a better solution would be something that's not JSON at all.
The baseline whitespace count/search operation seems like it would be MUCH faster if you vectorized it with SIMD, but I can understand that being out of scope for the author.
The comment is about having a directive or annotation to make the compiler inline the function for you, which Go does not have. IMO, the pre-inline code was cleaner to me. It's a shame that the compiler could not optimize it.
There was once a proposal for this, but it's really against Go's design as a language.
Sure you could argue go isn’t the right tool for the job but I don’t see why it can’t be done with the right optimizations like this effort.
I've pumped gigs of jaon data, so a streaming parser is appreciated. Plus streaming shows the author is better at engineering and is aware of the various use cases.
Memory is not cheap or free except in theory.
Do you mean XML SAX-like interface? If so, how do you deal with repeated keys in "hash tables"? Do you first translate JSON into intermediate objects (i.e. arrays, hash-tables) and then transform them into application-specific structures, or do you try to skip the intermediate step?
I mean, streaming tokens is kind of worthless on its own. If you are going for SAX-like interface, you want to be able to go all the way with streaming (i.e. in no layer of the code that reads JSON you don't "accumulate" data (esp. not possibly indefinitely) until it can be sent to the layer above that).
I mean, you have a 1k, 2k, 4k buffer. Why use more, because it's too much work?
That line feels like a troll. Cunningham’s Law in action.
You can definitely go faster than 2 Gb/sec. In a word, SIMD.
Problem A: read a stream of bytes, parse it as JSON
Problem B: read a stream of bytes, count how many bytes match a JSON whitespace character
Problem B should require fewer resources* to solve than problem A. So in that sense problem B is a relaxation of problem A, and a highly efficient implementation of problem B should be able to process bytes much more efficiently than an "optimal" implementation of problem A.
So in this sense, we can probably all agree with the author that counting whitespace bytes is an easier problem than the full parsing problem.
We're agreed that the author's implementation (half a page of go code that fits on a talk slide) to solve problem B isn't the most efficient way to solve problem B.
I remember reading somewhere the advice that to set a really solid target for benchmarking, you should avoid measuring the performance of implementations and instead try to estimate a theoretical upper bound on performance, based on say a simplified model of how the hardware works and a simplification of the problem -- that hopefully still captures the essence of what the bottleneck is. Then you can compare any implementation to that (unreachable) theoretical upper bound, to get more of an idea of how much performance is still left on the table.
* for reasonably boring choices of target platform, e.g. amd64 + ram, not some hypothetical hardware platform with surprisingly fast dedicated support for JSON parsing and bad support for anything else.
simdjson (https://github.com/simdjson/simdjson) claims to fully parse JSON on the order of 3 GB/sec. Which is faster than OP's Go whitespace parsing! These tests are running on different hardware so it's not apples-to-apples.
The phrase "cannot go faster than this" is just begging for a "well ackshully". Which I hate to do. But the fact that there is an existence proof of Problem A running faster in C++ SIMD than OP's Probably B scalar Go is quite interesting and worth calling out imho. But I admit it doesn't change the rest of the post.
From 1989:
https://raw.githubusercontent.com/spitbol/x32/master/docs/sp...
"Indirection in the Goto field is a more powerful version of the computed Goto which appears in some languages. It allows a program to quickly perform a multi-way control branch based on an item of data."
I don't know how "true" that comment is but I thought I should try to write a parser myself to get a feel :D
So I wrote one, in Python - https://arunmani.in/articles/silly-json-parser/
It was a delightful experience though, writing and testing to break your own code with different variety of inputs. :)
> I don't know how "true" that comment is
Either way it's a good way to get a pair of quadratic loops in your program: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...
Link here, if interested to have a look: https://github.com/rgex/jsoncut
I thought it would be nice and simple but it really was still simpler than I expected. It's a fantastic spec if you need to throw one together yourself, without massive performance considerations.
Funny enough I stumbled upon your article just yesterday through google search.
His lib became the defacto JSON lib in .NET dev and naturally, MS head-hunted him.
Fast JSON is that important these days.
"IMPORTANT: do not run the scripts below this line, they are for CICD only": true,As for not having trailing commas, it's probably a less intentional bad design choice.
That said, if you want commas and comments, and control the parsers that will be used for your JSON, then use JSONC (JSON with comments). VSCode for example does that for its JSON configuration.
TOML/Yaml always drive me batty with all their obscure special syntax. Whereas it's almost impossible to look at a formatted blob of JSON and not have a very solid understanding of what it represents.
The one thing I might add is multiline strings with `'s, but even that is probably more trouble than it's worth, as you immediately start going down the path of "well let's also have syntax to strip the indentation from those strings, maybe we should add new syntax to support raw strings, ..."
[1] https://web.archive.org/web/20150105080225/https://plus.goog... (thank you Internet Archive for making Google’s social network somewhat accessible and less than useless)
Other flavors of JSON that include support for comments and trailing commas exist, but they are reasonably called by different names. One of these is YAML (mostly a superset of JSON). To some extent the difficulties with YAML (like unquoted ‘no’ being a synonym for false) have vindicated Crockford’s priorities.
What you can do is: write comments using pound sign, and rename your files to YAML. You will also get a benefit of a million ways of writing multiline strings -- very confusing, but sometimes useful.
If you want comments, you can always use jsonc.
The most important thing, though, is the process: measure then optimize.
Is "\u0000" legal JSON?
P.S. ... and many other control characters < \u0020
Note the distinct lack of:
if err != nil {https://gist.github.com/JacksonKearl/6778c02bf85495d1e39291c...
Some example test cases:
{ input: '[{"a": 0, "b":', output: [{ a: 0 }] },
{ input: '[{"a": 0, "b": 1', output: [{ a: 0, b: 1 }] },
{ input: "[{},", output: [{}] },
{ input: "[{},1", output: [{}, 1] },
{ input: '[{},"', output: [{}, ""] },
{ input: '[{},"abc', output: [{}, "abc"] },
Work could be done to optimize it, for instance add streaming support. But the cycles consumed either way is minimal for LLM-output-length=constrained JSON.Fun fact: as best I can tell, GPT-4 is entirely unable to synthesize code to accomplish this task. Perhaps that will change as this implementation is made public, I do not know.
Forgiving parsers/lexers are common in language compilers for languages like rust or C# or typescript, you may want to investigate typescript in particular since it's applicable to JSON syntax. Maybe you could repurpose their parser.
I've been looking into something similar for handling partial JSONs, where you only have the first n chars of a JSON. This is common with LLM with streamed outputs aimed at reducing latency. If one knows the JSON schema ahead, then one can start processing these first fields before the remaining data has fully loaded. If you have to wait for the whole thing to load there is little point in streaming.
Was looking for a library that could do this parsing.