Building a high performance JSON parser (opens in new tab)

(dave.cheney.net)

532 pointsdavecheney2y ago189 comments

189 comments

jchw2y ago

Looks pretty good! Even though I've written far too many JSON parsers already in my career, it's really nice to have a reference for how to think about making a reasonable, fast JSON parser, going through each step individually.

That said, I will say one thing: you don't really need to have an explicit tokenizer for JSON. You can get rid of the concept of tokens and integrate parsing and tokenization entirely. This is what I usually do since it makes everything simpler. This is a lot harder to do with something like the rest of ECMAscript since in something like ECMAscript you wind up needing look-ahead (sometimes arbitrarily large look-ahead... consider arrow functions: it's mostly a subset of the grammar of a parenthesized expression. Comma is an operator, and for default values, equal is an operator. It isn't until the => does or does not appear that you know for sure!)

coldtea2y ago

What line of work are you in that you've "written far too many JSON parsers already" in your career?!!!

jchw2y ago

Reasons differ. C++ is a really hard place to be. It's gotten better, but if you can't tolerate exceptions, need code that is as-obviously-memory-safe-as-possible, can parse incrementally (think SAX style), off-the-shelf options like jsoncpp may not fit the bill.

Handling large documents is indeed another big one. It sort-of fits in the same category as being able to parse incrementally. That said, Go has a JSON scanner you can sort of use for incremental parsing, but in practice I've found it to be a lot slower, so for large documents it's a problem.

I've done a couple in hobby projects too. One time I did a partial one in Win32-style C89 because I wanted one that didn't depend on libc.

2 more replies

crabbone2y ago

I've written JSON parsers because in one instance we had to allow users to keep their formatting but also edit documents programmatically. At the time I couldn't find parsers that did that, but it was a while back.

In another instance, it was easier to parse into some application-specific structures, skipping the whole intermediate generic step (for performance reasons).

With JSON it's easier to convince your boss that you can actually write such a parser because the language is relatively simple (if you overlook botched definitions of basically every element...) So, for example, if the application that uses JSON is completely under your control, you may take advantage of stupid decisions made by JSON authors to simplify many things. More concretely, you can decide that there will never be more than X digits in numbers. That you will never use "null". Or that you will always put elements of the same type into "lists". Or that you will never repeat keys in "hash tables".

marcosdumay2y ago

I've seen "somebody doesn't agree with the standard and we must support it" way too many times, and I've written JSON parsers because of this. (And, of course, it's easy to get some difference with the JSON standard.)

I've had problems with handling streams like the OP on basically every programing language and data-encoding language pair that I've tried. It looks like nobody ever thinks about it (I do use chunking any time I can, but some times you can't).

There are probably lots and lots of reasons to write your own parser.

1 more reply

craigching2y ago

Probably anywhere that requires parsing large JSON documents. Off the shelf JSON parsers are notoriously slow on large JSON documents.

3 more replies

lgas2y ago

Someone misunderstood the JSONParserFactory somewhere along the line.

eatonphil2y ago

The walkthrough is very nice, how to do this if you're going to do it.

If you're going for pure performance in a production environment you might take a look at Daniel Lemire's work: https://github.com/simdjson/simdjson. Or the MinIO port of it to Go: https://github.com/minio/simdjson-go.

vjerancrnjak2y ago

If your JSON always looks the same you can also do better than general JSON parsers.

lylejantzi3rd2y ago

Andreas Fredriksson demonstrates exactly that in this video: https://vimeo.com/644068002

2 more replies

loeg2y ago

You might also move to something other than JSON if parsing it is a significant part of your workload.

1 more reply

diarrhea2y ago

I wonder: can fast, special-case JSON parsers be dynamically autogenerated from JSON Schemas?

Perhaps some macro-ridden Rust monstrosity that spits out specialised parsers at compile time, dynamically…

6 more replies

fooster2y ago

Last time I compared the performance of various json parsers the simd one turned out to be disappointingly slow.

Thaxll2y ago

The fastest json lib in Go is the one done by the company behind Tiktok.

rockinghigh2y ago

https://github.com/bytedance/sonic

ken472y ago

Fastest at what?

1 more reply

pizzafeelsright2y ago

Excellent treat vector.

pizzafeelsright2y ago

Excellent treat vector.

lionkor2y ago

simdjson has not been the fastest for a long long time

jzwinck2y ago

What is faster? According to https://github.com/kostya/benchmarks#json nothing is.

wood_spirit2y ago

My own lessons from writing fast json parsers has a lot of language-type things but here are some generalisations:

Avoid heap allocations in tokenising. Have a tokeniser that is a function that returns a stack-allocated struct or an int64 token that is a packed field describing the start, length and type offsets etc of the token.

Avoid heap allocations in parsing: support a getString(key String) type interface for clients that what to chop up a buffer.

For deserialising to object where you know the fields at compile time, generally generate a switch case of key length before comparing string values.

My experience in data pipelines that process lots of json is that choice of json library can be a 3-10x performance difference and that all the main parsers want to allocate objects.

If the classes you are serialising or deserialising is known at compile time then Jackson Java does a good job but you can get a 2x boost with careful coding and profiling.

Whereas if you are paying aribrary json then all the mainstream parsers want to do lots of allocations that a more intrusive parser that you write yourself can avoid, and that you can make massive performance wins if you are processing thousands or millions of objects per second.

jensneuse2y ago

I've taken a very similar approach and built a GraphQL tokenizer and parser (amongst many other things) that's also zero memory allocations and quite fast. In case you'd like to check out the code: https://github.com/wundergraph/graphql-go-tools

romshark2y ago

You might also want to check out this abomination of mine: https://github.com/graph-guard/gqlscan

I've held a talk about this, unfortunately wasn't recorded. I've tried to squeeze as much out of Go as I could and I've went crazy doing that :D

jensneuse2y ago

It's a bit verbose.

markl422y ago

How big of an issue is this for GQL servers where all queries are known ahead of time (allowlist) - i.e. you can cache/memorize the ast parsing and this is only a perf issue for a few minutes after the container starts up

Or does this bite us in other ways too?

jensneuse2y ago

I build GraphQL API gateways / Routers for 5+ years now. It would be nice if trusted Documents or persisted operations were the default, but the reality is that a lot of people want to open up their GraphQL to the public. For that reason we've build a fast parser, validator, normalizer and many other things to support these use cases.

evmar2y ago

In n2[1] I needed a fast tokenizer and had the same "garbage factory" problem, which is basically that there's a set of constant tokens (like json.Delim in this post) and then strings which cause allocations.

I came up with what I think is a kind of neat solution, which is that the tokenizer is generic over some T and takes a function from byteslice to T and uses T in place of the strings. This way, when the caller has some more efficient representation available (like one that allocates less) it can provide one, but I can still unit test the tokenizer with the identity function for convenience.

In a sense this is like fusing the tokenizer with the parser at build time, but the generic allows layering the tokenizer such that it doesn't know about the parser's representation.

[1] https://github.com/evmar/n2

ncruces2y ago

It's possible to improve over the standard library with better API design, but it's not really possible to do a fully streaming parser that doesn't half fill structures before finding an error and bailing out in the middle, which is another explicit design constraint for the standard library.

crabbone2y ago

Maybe I overlooked something, but the author keeps repeating that they wrote a "streaming" parser, but never explained what that actually means. In particular, they never explained how did they deal with repeating keys in "hash tables". What does their parser do? Calls the "sink" code twice with the repeated key? Waits until the entire "hash table" is red and then calls the "sink" code?

In my mind, JSON is inherently inadequate for streaming because of hierarchical structure, no length know upfront and most importantly, repeating keys. You could probably make a subset of JSON more streaming-friendly, but at this point, why bother? I mean, if the solution is to modify JSON, then a better solution would be something that's not JSON at all.

rexfuzzle2y ago

Great to see a shout out to Phil Pearl! Also worth looking at https://github.com/bytedance/sonic

kevingadd2y ago

I'm surprised there's no way to say 'I really mean it, inline this function' for the stuff that didn't inline because it was too big.

The baseline whitespace count/search operation seems like it would be MUCH faster if you vectorized it with SIMD, but I can understand that being out of scope for the author.

mgaunard2y ago

Of course you can force-inline.

cbarrick2y ago

Obviously you can manually inline functions. That's what happened in the article.

The comment is about having a directive or annotation to make the compiler inline the function for you, which Go does not have. IMO, the pre-inline code was cleaner to me. It's a shame that the compiler could not optimize it.

There was once a proposal for this, but it's really against Go's design as a language.

https://github.com/golang/go/issues/21536

1 more reply

mgaunard2y ago

"It’s unrealistic to expect to have the entire input in memory" -- wrong for most applications

ahoka2y ago

Most applications read JSONs from networks, where you have a stream. Buffering and fiddling with the whole request in memory increases latency by a lot, even if your JSON is smallish.

Rapzid2y ago

Most(most) JSON payloads are probably much smaller than many buffer sizes so just end up all in memory anyway.

mgaunard2y ago

On a carefully built WebSocket server you would ensure your WebSocket messages all fit within a single MTU.

isuckatcoding2y ago

Yes but for applications where you need to do ETL style transformations on large datasets, streaming is an immensely useful strategy.

Sure you could argue go isn’t the right tool for the job but I don’t see why it can’t be done with the right optimizations like this effort.

dh20222y ago

If performance is important why would you keep large datasets in JSON format?

3 more replies

mannyv2y ago

If you're building a library you either need to explicitly call out your limits or do streaming.

I've pumped gigs of jaon data, so a streaming parser is appreciated. Plus streaming shows the author is better at engineering and is aware of the various use cases.

Memory is not cheap or free except in theory.

crabbone2y ago

Here people confidently keep repeating "streaming JSON". What do you mean by that? I'm genuinely curios.

Do you mean XML SAX-like interface? If so, how do you deal with repeated keys in "hash tables"? Do you first translate JSON into intermediate objects (i.e. arrays, hash-tables) and then transform them into application-specific structures, or do you try to skip the intermediate step?

I mean, streaming tokens is kind of worthless on its own. If you are going for SAX-like interface, you want to be able to go all the way with streaming (i.e. in no layer of the code that reads JSON you don't "accumulate" data (esp. not possibly indefinitely) until it can be sent to the layer above that).

1 more reply

jjeaff2y ago

I guess it's all relative. Memory is significantly cheaper if you get it anywhere but on loan from a cloud provider.

1 more reply

e12e2y ago

If you can live with "fits on disk" mmap() is a viable option? Unless you truly need streaming (early handling of early data, like a stream of transactions/operations from a single JSON file?)

mannyv2y ago

In general, JSON comes over the network, so MMAP won't really work unless you save to a file. But then you'll run out of disk space.

I mean, you have a 1k, 2k, 4k buffer. Why use more, because it's too much work?

kiitos2y ago

Is the HTTP request body part of the input?

capableweb2y ago

https://yourdatafitsinram.net/

peterohler2y ago

You might want to take a look at https://github.com/ohler55/ojg. It takes a different approach with a single pass parser. There are some performance benchmarks included on the README.md landing page.

forrestthewoods2y ago

> Any (useful) JSON decoder code cannot go faster that this.

That line feels like a troll. Cunningham’s Law in action.

You can definitely go faster than 2 Gb/sec. In a word, SIMD.

shoo2y ago

we could re-frame by distinguishing problem statements from implementations

Problem A: read a stream of bytes, parse it as JSON

Problem B: read a stream of bytes, count how many bytes match a JSON whitespace character

Problem B should require fewer resources* to solve than problem A. So in that sense problem B is a relaxation of problem A, and a highly efficient implementation of problem B should be able to process bytes much more efficiently than an "optimal" implementation of problem A.

So in this sense, we can probably all agree with the author that counting whitespace bytes is an easier problem than the full parsing problem.

We're agreed that the author's implementation (half a page of go code that fits on a talk slide) to solve problem B isn't the most efficient way to solve problem B.

I remember reading somewhere the advice that to set a really solid target for benchmarking, you should avoid measuring the performance of implementations and instead try to estimate a theoretical upper bound on performance, based on say a simplified model of how the hardware works and a simplification of the problem -- that hopefully still captures the essence of what the bottleneck is. Then you can compare any implementation to that (unreachable) theoretical upper bound, to get more of an idea of how much performance is still left on the table.

* for reasonably boring choices of target platform, e.g. amd64 + ram, not some hypothetical hardware platform with surprisingly fast dedicated support for JSON parsing and bad support for anything else.

forrestthewoods2y ago

Everything you said is totally reasonable. I'm a big fan of napkin math and theoretical upper bounds on performance.

simdjson (https://github.com/simdjson/simdjson) claims to fully parse JSON on the order of 3 GB/sec. Which is faster than OP's Go whitespace parsing! These tests are running on different hardware so it's not apples-to-apples.

The phrase "cannot go faster than this" is just begging for a "well ackshully". Which I hate to do. But the fact that there is an existence proof of Problem A running faster in C++ SIMD than OP's Probably B scalar Go is quite interesting and worth calling out imho. But I admit it doesn't change the rest of the post.

1vuio0pswjnm72y ago

"But there is a better trick that we can use that is more space efficient than this table, and is sometimes called a computed goto."

From 1989:

https://raw.githubusercontent.com/spitbol/x32/master/docs/sp...

"Indirection in the Goto field is a more powerful version of the computed Goto which appears in some languages. It allows a program to quickly perform a multi-way control branch based on an item of data."

arun-mani-j2y ago

I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.

I don't know how "true" that comment is but I thought I should try to write a parser myself to get a feel :D

So I wrote one, in Python - https://arunmani.in/articles/silly-json-parser/

It was a delightful experience though, writing and testing to break your own code with different variety of inputs. :)

masklinn2y ago

> I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.

> I don't know how "true" that comment is

Either way it's a good way to get a pair of quadratic loops in your program: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

janmo2y ago

I wrote a small JSON parser in C myself which I called jsoncut. It just cuts out a certain part of a json file. I deal with large JSON files, but want only to extract and parse certain parts of it. All libraries I tried parse everything, use a lot of RAM and are slow.

Link here, if interested to have a look: https://github.com/rgex/jsoncut

vlovich1232y ago

The words you’re looking for are SAX-like JSON parser or streaming json parser. I don’t know if there’s any command line tools like the one you wrote that use it though to provide a jq-like interface.

1 more reply

jjice2y ago

I guess there are only so many ways to write a JSON parser b cause one I wrote on a train in Python looks very similar!

I thought it would be nice and simple but it really was still simpler than I expected. It's a fantastic spec if you need to throw one together yourself, without massive performance considerations.

xoac2y ago

Good for you but what does this have to do with the article?

wslh2y ago

I remember this JSON benchmark page from RapidJSON [1].

[1] https://rapidjson.org/md_doc_performance.html

romshark2y ago

I've recently held a talk (https://youtu.be/a7VBbbcmxyQ?si=0fGVxfc4qmKMVCXk) about github.com/romshark/jscan that I've been working on. It's a performance-oriented JSON iterator / tokenizer you might want to take a look at if interested in high performance zero allocation JSON parsing in Go.

isuckatcoding2y ago

This is fantastically useful.

Funny enough I stumbled upon your article just yesterday through google search.

rurban2y ago

This is a very poor and overly simplified text to write basic JSON parsers, not touching any topic of writing actually fast JSON parsers. Such as not-copying tokenizers (e.g. jsmn), word-wise tokenizers (simdjson) and fast numeric conversions (fast_double_parser at al).

EdwardDiego2y ago

A person who helped me out a lot when I was learning to code wrote his own .NET JSON library because the MS provided one had a rough API and was quite slow.

His lib became the defacto JSON lib in .NET dev and naturally, MS head-hunted him.

Fast JSON is that important these days.

nwpierce2y ago

Writing a json parser is definitely an educational experience. I wrote one this summer for my own purposes that is decently fast: https://github.com/nwpierce/jsb

suzzer992y ago

Can someone explain to me why JSON can't have comments or trailing commas? I really hope the performance gains are worth it, because I've lost 100s of man-hours to those things, and had to resort to stuff like this in package.json:

  "IMPORTANT: do not run the scripts below this line, they are for CICD only": true,

coldtea2y ago

It can't have comments because it didn't originally had comments, so now it's too late. And it originally didn't have comments, because Douglas Cockford thought they could be abused for parsing instructions.

As for not having trailing commas, it's probably a less intentional bad design choice.

That said, if you want commas and comments, and control the parsers that will be used for your JSON, then use JSONC (JSON with comments). VSCode for example does that for its JSON configuration.

explaininjs2y ago

JSONC also supports trailing commas. It is, in effect, "JSON with no downsides".

TOML/Yaml always drive me batty with all their obscure special syntax. Whereas it's almost impossible to look at a formatted blob of JSON and not have a very solid understanding of what it represents.

The one thing I might add is multiline strings with `'s, but even that is probably more trouble than it's worth, as you immediately start going down the path of "well let's also have syntax to strip the indentation from those strings, maybe we should add new syntax to support raw strings, ..."

mananaysiempre2y ago

Amusingly, it originally did have comments. Removing comments was the one change Crockford ever made to the spec[1].

[1] https://web.archive.org/web/20150105080225/https://plus.goog... (thank you Internet Archive for making Google’s social network somewhat accessible and less than useless)

tubthumper82y ago

Does JSONC have a specification or formal definition? People have suggested[1] using JSON5[2] instead for that reason

[1] https://github.com/microsoft/vscode/issues/100688

[2] https://spec.json5.org/

1 more reply

nigeltao2y ago

If you want commas and comments, another term to search for is JWCC, which literally stands for "JSON With Commas and Comments". HuJSON is another name for it.

suzzer992y ago

Yeah if you could get NPM to allow JSONC in package.json, that'd be great.

semiquaver2y ago

It’s not that it “can’t”, more like it “doesn’t”. Douglas Crockford prioritized simplicity when specifying JSON. Its BNF grammar famously fits on one side of a business card.

Other flavors of JSON that include support for comments and trailing commas exist, but they are reasonably called by different names. One of these is YAML (mostly a superset of JSON). To some extent the difficulties with YAML (like unquoted ‘no’ being a synonym for false) have vindicated Crockford’s priorities.

crabbone2y ago

There's no real reason for that. It just happened like that. These aren't the only bad decisions made by JSON author, and not the worst either.

What you can do is: write comments using pound sign, and rename your files to YAML. You will also get a benefit of a million ways of writing multiline strings -- very confusing, but sometimes useful.

shepherdjerred2y ago

I don't know the historic reason why it wasn't included in the original spec, but at this point it doesn't matter. JSON is entrenched and not going to change.

If you want comments, you can always use jsonc.

mleonhard2y ago

I wrote a Rust library that works similarly to the author's byteReader: https://crates.io/crates/fixed-buffer

mannyv2y ago

These are always interesting to read because you get to see runtime quirks. I'm surprised there was so much function call overhead, for example. And it's interesting you can bypass range checkong.

The most important thing, though, is the process: measure then optimize.

flaie2y ago

This was a very good read, and I did learn some nice tricks, thank you very much.

mikhailfranco2y ago

I notice 'sample.json' contains quite a few escaped nulls \u0000 inside quoted strings.

Is "\u0000" legal JSON?

P.S. ... and many other control characters < \u0020

cratermoon2y ago

My favorite bit about this is his reference to John Ousterhout, Define errors out of existence. youtu.be/bmSAYlu0NcY?si=WjC1ouEN1ad2OWjp&t=1312

Note the distinct lack of:

    if err != nil {

hintymad2y ago

How is this compared to Daniel Lemire's simdjson? https://github.com/simdjson/simdjson

thomasvn2y ago

In what cases would an application need to regularly parse gigabytes of JSON? Wouldn't it be advantageous for the app to get that data into a DB?

denysvitali2y ago

Also interesting: https://youtu.be/a7VBbbcmxyQ

lamontcg2y ago

Wish I wasn't 4 or 5 uncompleted projects deep right now and had the time to rewrite a monkey parser using all these tricks.

hknmtt2y ago

what does this bring over goccy's json encoder?

visarga2y ago

nowadays I am more interested in a "forgiving" JSON/YAML parser, that would recover from LLM errors, is there such a thing?

explaininjs2y ago

Perhaps not quite what you're asking for, but along the same lines there's this "Incomplete JSON" parser, which takes a string of JSON as it's coming out of an LLM and parses it into as much data as it can get. Useful for building streaming UI's, for instance it is used on https://rexipie.com quite extensively.

https://gist.github.com/JacksonKearl/6778c02bf85495d1e39291c...

Some example test cases:

    { input: '[{"a": 0, "b":', output: [{ a: 0 }] },
    { input: '[{"a": 0, "b": 1', output: [{ a: 0, b: 1 }] },

    { input: "[{},", output: [{}] },
    { input: "[{},1", output: [{}, 1] },
    { input: '[{},"', output: [{}, ""] },
    { input: '[{},"abc', output: [{}, "abc"] },

Work could be done to optimize it, for instance add streaming support. But the cycles consumed either way is minimal for LLM-output-length=constrained JSON.

Fun fact: as best I can tell, GPT-4 is entirely unable to synthesize code to accomplish this task. Perhaps that will change as this implementation is made public, I do not know.

kevingadd2y ago

If the LLM did such a bad job that the syntax is wrong, do you really trust the data inside?

Forgiving parsers/lexers are common in language compilers for languages like rust or C# or typescript, you may want to investigate typescript in particular since it's applicable to JSON syntax. Maybe you could repurpose their parser.

RichieAHB2y ago

I feel like trying to infer valid JSON from invalid JSON is a recipe for garbage. You’d probably be better off doing a second pass with the “JSON” through the LLM but, as the sibling commenter said, at this point even the good JSON may be garbage …

gurrasson2y ago

The jsonrepair tool https://github.com/josdejong/jsonrepair might interest you. It's tailored to fix JSON strings.

I've been looking into something similar for handling partial JSONs, where you only have the first n chars of a JSON. This is common with LLM with streamed outputs aimed at reducing latency. If one knows the JSON schema ahead, then one can start processing these first fields before the remaining data has fully loaded. If you have to wait for the whole thing to load there is little point in streaming.

Was looking for a library that could do this parsing.

explaininjs2y ago

See my sibling comment :)

_dain_2y ago

halloween was last week

j / k navigate · click thread line to collapse

189 comments

jchw2y ago

coldtea2y ago

What line of work are you in that you've "written far too many JSON parsers already" in your career?!!!

jchw2y ago

I've done a couple in hobby projects too. One time I did a partial one in Win32-style C89 because I wanted one that didn't depend on libc.

2 more replies

crabbone2y ago

In another instance, it was easier to parse into some application-specific structures, skipping the whole intermediate generic step (for performance reasons).

marcosdumay2y ago

There are probably lots and lots of reasons to write your own parser.

1 more reply

craigching2y ago

Probably anywhere that requires parsing large JSON documents. Off the shelf JSON parsers are notoriously slow on large JSON documents.

3 more replies

lgas2y ago

Someone misunderstood the JSONParserFactory somewhere along the line.

eatonphil2y ago

The walkthrough is very nice, how to do this if you're going to do it.

vjerancrnjak2y ago

If your JSON always looks the same you can also do better than general JSON parsers.

lylejantzi3rd2y ago

Andreas Fredriksson demonstrates exactly that in this video: https://vimeo.com/644068002

2 more replies

loeg2y ago

You might also move to something other than JSON if parsing it is a significant part of your workload.

1 more reply

diarrhea2y ago

I wonder: can fast, special-case JSON parsers be dynamically autogenerated from JSON Schemas?

Perhaps some macro-ridden Rust monstrosity that spits out specialised parsers at compile time, dynamically…

6 more replies

fooster2y ago

Last time I compared the performance of various json parsers the simd one turned out to be disappointingly slow.

Thaxll2y ago

The fastest json lib in Go is the one done by the company behind Tiktok.

rockinghigh2y ago

https://github.com/bytedance/sonic

ken472y ago

Fastest at what?

1 more reply

pizzafeelsright2y ago

Excellent treat vector.

pizzafeelsright2y ago

Excellent treat vector.

lionkor2y ago

simdjson has not been the fastest for a long long time

jzwinck2y ago

What is faster? According to https://github.com/kostya/benchmarks#json nothing is.

wood_spirit2y ago

My own lessons from writing fast json parsers has a lot of language-type things but here are some generalisations:

Avoid heap allocations in parsing: support a getString(key String) type interface for clients that what to chop up a buffer.

For deserialising to object where you know the fields at compile time, generally generate a switch case of key length before comparing string values.

My experience in data pipelines that process lots of json is that choice of json library can be a 3-10x performance difference and that all the main parsers want to allocate objects.

If the classes you are serialising or deserialising is known at compile time then Jackson Java does a good job but you can get a 2x boost with careful coding and profiling.

jensneuse2y ago

romshark2y ago

You might also want to check out this abomination of mine: https://github.com/graph-guard/gqlscan

I've held a talk about this, unfortunately wasn't recorded. I've tried to squeeze as much out of Go as I could and I've went crazy doing that :D

jensneuse2y ago

It's a bit verbose.

markl422y ago

Or does this bite us in other ways too?

jensneuse2y ago

evmar2y ago

In a sense this is like fusing the tokenizer with the parser at build time, but the generic allows layering the tokenizer such that it doesn't know about the parser's representation.

[1] https://github.com/evmar/n2

ncruces2y ago

crabbone2y ago

rexfuzzle2y ago

Great to see a shout out to Phil Pearl! Also worth looking at https://github.com/bytedance/sonic

kevingadd2y ago

I'm surprised there's no way to say 'I really mean it, inline this function' for the stuff that didn't inline because it was too big.

The baseline whitespace count/search operation seems like it would be MUCH faster if you vectorized it with SIMD, but I can understand that being out of scope for the author.

mgaunard2y ago

Of course you can force-inline.

cbarrick2y ago

Obviously you can manually inline functions. That's what happened in the article.

There was once a proposal for this, but it's really against Go's design as a language.

https://github.com/golang/go/issues/21536

1 more reply

mgaunard2y ago

"It’s unrealistic to expect to have the entire input in memory" -- wrong for most applications

ahoka2y ago

Most applications read JSONs from networks, where you have a stream. Buffering and fiddling with the whole request in memory increases latency by a lot, even if your JSON is smallish.

Rapzid2y ago

Most(most) JSON payloads are probably much smaller than many buffer sizes so just end up all in memory anyway.

mgaunard2y ago

On a carefully built WebSocket server you would ensure your WebSocket messages all fit within a single MTU.

isuckatcoding2y ago

Yes but for applications where you need to do ETL style transformations on large datasets, streaming is an immensely useful strategy.

Sure you could argue go isn’t the right tool for the job but I don’t see why it can’t be done with the right optimizations like this effort.

dh20222y ago

If performance is important why would you keep large datasets in JSON format?

3 more replies

mannyv2y ago

If you're building a library you either need to explicitly call out your limits or do streaming.

I've pumped gigs of jaon data, so a streaming parser is appreciated. Plus streaming shows the author is better at engineering and is aware of the various use cases.

Memory is not cheap or free except in theory.

crabbone2y ago

Here people confidently keep repeating "streaming JSON". What do you mean by that? I'm genuinely curios.

1 more reply

jjeaff2y ago

I guess it's all relative. Memory is significantly cheaper if you get it anywhere but on loan from a cloud provider.

1 more reply

e12e2y ago

If you can live with "fits on disk" mmap() is a viable option? Unless you truly need streaming (early handling of early data, like a stream of transactions/operations from a single JSON file?)

mannyv2y ago

In general, JSON comes over the network, so MMAP won't really work unless you save to a file. But then you'll run out of disk space.

I mean, you have a 1k, 2k, 4k buffer. Why use more, because it's too much work?

kiitos2y ago

Is the HTTP request body part of the input?

capableweb2y ago

https://yourdatafitsinram.net/

peterohler2y ago

You might want to take a look at https://github.com/ohler55/ojg. It takes a different approach with a single pass parser. There are some performance benchmarks included on the README.md landing page.

forrestthewoods2y ago

> Any (useful) JSON decoder code cannot go faster that this.

That line feels like a troll. Cunningham’s Law in action.

You can definitely go faster than 2 Gb/sec. In a word, SIMD.

shoo2y ago

we could re-frame by distinguishing problem statements from implementations

Problem A: read a stream of bytes, parse it as JSON

Problem B: read a stream of bytes, count how many bytes match a JSON whitespace character

So in this sense, we can probably all agree with the author that counting whitespace bytes is an easier problem than the full parsing problem.

We're agreed that the author's implementation (half a page of go code that fits on a talk slide) to solve problem B isn't the most efficient way to solve problem B.

forrestthewoods2y ago

Everything you said is totally reasonable. I'm a big fan of napkin math and theoretical upper bounds on performance.

1vuio0pswjnm72y ago

"But there is a better trick that we can use that is more space efficient than this table, and is sometimes called a computed goto."

From 1989:

https://raw.githubusercontent.com/spitbol/x32/master/docs/sp...

arun-mani-j2y ago

I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.

I don't know how "true" that comment is but I thought I should try to write a parser myself to get a feel :D

So I wrote one, in Python - https://arunmani.in/articles/silly-json-parser/

It was a delightful experience though, writing and testing to break your own code with different variety of inputs. :)

masklinn2y ago

> I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.

> I don't know how "true" that comment is

Either way it's a good way to get a pair of quadratic loops in your program: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

janmo2y ago

Link here, if interested to have a look: https://github.com/rgex/jsoncut

vlovich1232y ago

1 more reply

jjice2y ago

I guess there are only so many ways to write a JSON parser b cause one I wrote on a train in Python looks very similar!

I thought it would be nice and simple but it really was still simpler than I expected. It's a fantastic spec if you need to throw one together yourself, without massive performance considerations.

xoac2y ago

Good for you but what does this have to do with the article?

wslh2y ago

I remember this JSON benchmark page from RapidJSON [1].

[1] https://rapidjson.org/md_doc_performance.html

romshark2y ago

isuckatcoding2y ago

This is fantastically useful.

Funny enough I stumbled upon your article just yesterday through google search.

rurban2y ago

EdwardDiego2y ago

A person who helped me out a lot when I was learning to code wrote his own .NET JSON library because the MS provided one had a rough API and was quite slow.

His lib became the defacto JSON lib in .NET dev and naturally, MS head-hunted him.

Fast JSON is that important these days.

nwpierce2y ago

Writing a json parser is definitely an educational experience. I wrote one this summer for my own purposes that is decently fast: https://github.com/nwpierce/jsb

suzzer992y ago

  "IMPORTANT: do not run the scripts below this line, they are for CICD only": true,

coldtea2y ago

As for not having trailing commas, it's probably a less intentional bad design choice.

That said, if you want commas and comments, and control the parsers that will be used for your JSON, then use JSONC (JSON with comments). VSCode for example does that for its JSON configuration.

explaininjs2y ago

JSONC also supports trailing commas. It is, in effect, "JSON with no downsides".

mananaysiempre2y ago

Amusingly, it originally did have comments. Removing comments was the one change Crockford ever made to the spec[1].

[1] https://web.archive.org/web/20150105080225/https://plus.goog... (thank you Internet Archive for making Google’s social network somewhat accessible and less than useless)

tubthumper82y ago

Does JSONC have a specification or formal definition? People have suggested[1] using JSON5[2] instead for that reason

[1] https://github.com/microsoft/vscode/issues/100688

[2] https://spec.json5.org/

1 more reply

nigeltao2y ago

If you want commas and comments, another term to search for is JWCC, which literally stands for "JSON With Commas and Comments". HuJSON is another name for it.

suzzer992y ago

Yeah if you could get NPM to allow JSONC in package.json, that'd be great.

semiquaver2y ago

It’s not that it “can’t”, more like it “doesn’t”. Douglas Crockford prioritized simplicity when specifying JSON. Its BNF grammar famously fits on one side of a business card.

crabbone2y ago

There's no real reason for that. It just happened like that. These aren't the only bad decisions made by JSON author, and not the worst either.

What you can do is: write comments using pound sign, and rename your files to YAML. You will also get a benefit of a million ways of writing multiline strings -- very confusing, but sometimes useful.

shepherdjerred2y ago

I don't know the historic reason why it wasn't included in the original spec, but at this point it doesn't matter. JSON is entrenched and not going to change.

If you want comments, you can always use jsonc.

mleonhard2y ago

I wrote a Rust library that works similarly to the author's byteReader: https://crates.io/crates/fixed-buffer

mannyv2y ago

These are always interesting to read because you get to see runtime quirks. I'm surprised there was so much function call overhead, for example. And it's interesting you can bypass range checkong.

The most important thing, though, is the process: measure then optimize.

flaie2y ago

This was a very good read, and I did learn some nice tricks, thank you very much.

mikhailfranco2y ago

I notice 'sample.json' contains quite a few escaped nulls \u0000 inside quoted strings.

Is "\u0000" legal JSON?

P.S. ... and many other control characters < \u0020

cratermoon2y ago

My favorite bit about this is his reference to John Ousterhout, Define errors out of existence. youtu.be/bmSAYlu0NcY?si=WjC1ouEN1ad2OWjp&t=1312

Note the distinct lack of:

    if err != nil {

hintymad2y ago

How is this compared to Daniel Lemire's simdjson? https://github.com/simdjson/simdjson

thomasvn2y ago

In what cases would an application need to regularly parse gigabytes of JSON? Wouldn't it be advantageous for the app to get that data into a DB?

denysvitali2y ago

Also interesting: https://youtu.be/a7VBbbcmxyQ

lamontcg2y ago

Wish I wasn't 4 or 5 uncompleted projects deep right now and had the time to rewrite a monkey parser using all these tricks.

hknmtt2y ago

what does this bring over goccy's json encoder?

visarga2y ago

nowadays I am more interested in a "forgiving" JSON/YAML parser, that would recover from LLM errors, is there such a thing?

explaininjs2y ago

https://gist.github.com/JacksonKearl/6778c02bf85495d1e39291c...

Some example test cases:

    { input: '[{"a": 0, "b":', output: [{ a: 0 }] },
    { input: '[{"a": 0, "b": 1', output: [{ a: 0, b: 1 }] },

    { input: "[{},", output: [{}] },
    { input: "[{},1", output: [{}, 1] },
    { input: '[{},"', output: [{}, ""] },
    { input: '[{},"abc', output: [{}, "abc"] },

Work could be done to optimize it, for instance add streaming support. But the cycles consumed either way is minimal for LLM-output-length=constrained JSON.

Fun fact: as best I can tell, GPT-4 is entirely unable to synthesize code to accomplish this task. Perhaps that will change as this implementation is made public, I do not know.

kevingadd2y ago

If the LLM did such a bad job that the syntax is wrong, do you really trust the data inside?

RichieAHB2y ago

gurrasson2y ago

The jsonrepair tool https://github.com/josdejong/jsonrepair might interest you. It's tailored to fix JSON strings.

Was looking for a library that could do this parsing.

explaininjs2y ago

See my sibling comment :)

_dain_2y ago

halloween was last week

j / k navigate · click thread line to collapse