Simdjson – Parsing Gigabytes of JSON per Second (opens in new tab)

(github.com)

598 pointscmsimike7y ago196 comments

196 comments

This is very cool. Meanwhile, in the xi-editor project, we're struggling with the fact that Swift JSON parsing is very slow. My benchmarking clocked in at 0.00089GB/s for Swift 4, and things don't seem to have improved much with Swift 5. I'm encouraging people on that issue to do a blog post.

[1]: https://github.com/xi-editor/xi-mac/issues/102

eridius7y ago

I wrote my own Swift JSON parser quite a while ago, https://github.com/postmates/PMJSON. In my limited benchmarking it parses slower than Foundation's JSONSerialization (by a factor of 2–2.5 IIRC) but encodes faster, and my impression was most of the time was spent constructing Dictionaries, but I didn't do too much performance work on it. It might be interesting to have someone else take a crack at improving the performance.

That said, it also includes an event-based parser (called JSONDecoder), so if you want to handle events in order to decode into your own data structure and skip the intermediate JSON data structure, you might be able to get faster than JSONSerialization that way.

marton787y ago

Why does Xi use JSON in the first place? It would be easier and faster to use a binary format, e.g. Protobufs, Flatbuffers or if the semantics of JSON is needed: CBOR.

aratno7y ago

From “Design Decisions”[1]:

> JSON. The protocol for front-end / back-end communication, as well as between the back-end and plug-ins, is based on simple JSON messages. I considered binary formats, but the actual improvement in performance would be completely in the noise. Using JSON considerably lowers friction for developing plug-ins, as it’s available out of the box for most modern languages, and there are plenty of the libraries available for the other ones.

1: https://github.com/xi-editor/xi-editor/blob/master/README.md...

nothrabannosir7y ago

So is it too slow or not?

3 more replies

anderspitman7y ago

I'm going off topic at this point but I'd think for a native app the main advantages of a binary format would be the static typing and code generation that come from using an IDL.

2 more replies

Skinney7y ago

Because JSON encoding/decoding was not found to be a typical performance bottleneck, and because JSON is supported in virtually every programming language (Xi allows you to write frontends in pretty much any language you want).

kragen7y ago

After spending most of a year doing deep surgery on systems that used CBOR extensively, I can report that the common CBOR parsers are not faster than common JSON parsers; surprisingly, they are actually slower. CBOR is also not easier; it's much less widely supported, and you need a separate debugging representation. It does have three real advantages over JSON: it supports binary strings, it's a monument to Carsten Bormann's ego, and data encoded in CBOR takes slightly fewer bytes than the same data encoded in JSON. (The second is only an advantage if you're Carsten Bormann.)

spc4767y ago

There are a few more advantages to CBOR:

1) there's a distinction between integers and floating point values;

2) you can semantically tag values (yes, this is a text string, but treat it as a date; this is a binary string, but treat it as a big number; etc.);

3) you can have maps with non-text keys.

I'm not sure what Carsten Bormann's ego has to do with CBOR, but I found RFC-7049 one of the better written specs, with plenty of encoding examples. It made it real easy to write a encoder/decoder [1] and use the examples as test cases.

[1] https://github.com/spc476/CBOR

1 more reply

mpweiher7y ago

While theoretically true, in practice the actual character parsing tends to a small to negligible part of the overall time. Which leads to the measurable fact that on macOS/iOS, the JSON serialization stuff is actually one of their fastest, faster than their binary stuff.

saagarjha7y ago

I ran one of the Codable benchmarks in instruments, and here's what the top functions were:

  19.98 s   swift_getGenericMetadata
  19.15 s   newJSONString
  16.17 s   objc_msgSend
  15.33 s   _swift_release_(swift::HeapObject*)
  14.45 s   tiny_malloc_should_clear
  12.81 s   _swift_retain_(swift::HeapObject*)
  11.28 s   searchInConformanceCache(swift::TargetMetadata<swift::InProcess> const*, swift::TargetProtocolDescriptor<swift::InProcess> const*)
  10.46 s   swift_dynamicCastImpl(swift::OpaqueValue*, swift::OpaqueValue*, swift::TargetMetadata<swift::InProcess> const*, swift::TargetMetadata<swift::InProcess> const*, swift::DynamicCastFlags)

So it looks like a lot of the time is going into memory management or the Swift runtime performing type checking.

raphlinus7y ago

Yeah, I've done some analysis, it's creating a ton of objects to conform to the Codable protocol, and a lot of those objects are for codingPath, which is updated for basically every node in the tree. It's not a mystery, we just don't know the best way to fix it.

saagarjha7y ago

Is there a reason you need to use Codable? Sorry if this sounds uninformed, I haven't taken that much time to look at what you're doing exactly (I just ran https://github.com/jeremywiebe/json-performance).

2 more replies

Cthulhu_7y ago

Can you see any differences with different levels of optimization? I recall a presentation at some point where the old obj-C style compiled code did a lot of checks before and after calling a method ("does this object listen to this message?"), while with an optimization option enabled (whole module optimization?) these calls could be optimized out. That is, with Swift they can make the resulting machine code less er, "checking for safety", so to speak.

saagarjha7y ago

This was done at -O I believe (whatever the default is for "Profiling" in Xcode). This is anecdotal, but the fact that the code isn't littered with _swift_retain/_swift_release calls probably means that most of the standard reference-counting boilerplate has been optimized away.

mpweiher7y ago

Yeah, Swift-most-everything is pretty slow, but particularly parsing/generating. Pre-Swift Foundation serialisation code was already...majestic, and in the Swift conversion they've typically managed to slow things down even further. Which didn't seem possible, but they managed.

I have given a bunch of talks[1] on this topic, there's also a chapter in my iOS/macOS performance book[2], which I really recommend if you want to understand this particular topic. I did really fast XML[3][4], CSV[5] and binary plist parsers[6] for Cocoa and also a fast JSON serialiser[7]. All of these are usually around an order of magnitude faster than their Apple equivalents.

Sadly, I haven't gotten around to doing a JSON parser. One reason for this is that parsing the JSON at character level is actually the smaller problem, performance-wise, same as for XML. Performance tends to be largely determined by what you create as a result. If you crate generic Foundation/Swift dictionaries/arrays/etc. you have already lost. The overhead of these generic data structure completely overwhelms the cost of scanning a few bytes.

So you need something more akin to a steaming interface, and if you create objects you must create them directly, without generic temporary objects. This is where XML is easier, because it has an opening tag that you can use to determine what object to create. With JSON, you get "{" so basically you have to know what structure level corresponds to what objects.

Maybe I should write that parser...

[1] https://www.google.com/search?hl=en&q=marcel%20weiher%20perf...

[2] https://www.amazon.com/gp/product/0321842847/

[3] https://github.com/mpw/Objective-XML

[4] https://blog.metaobject.com/2010/05/xml-performance-revisite...

[5] https://github.com/mpw/MPWFoundation/blob/master/Collections...

[6] https://github.com/mpw/MPWFoundation/blob/master/Collections...

[7] https://github.com/mpw/MPWFoundation/blob/master/Streams.sub...

gritzko7y ago

That resonates well with my conclusions that led to the Replicated Object Notation project. [1]. If the parser creates an AST tree or some number of dictionaries or some other bullshit... "now you have two problems", that's it.

I settled on a tabular-log format, which is streamed and immediately consumed most of the time, no intermediate object structures.

Then, that "text vs binary" distinction became mostly moot. The binary is slightly more efficient, but grossly less readable, so no big gain, unless at grand scale.

[1] http://replicated.cc

azinman27y ago

What are you using? Have you tried NSJSONSerialization? It’s quite fast (am very curious how it shows in these benchmarks), but I don’t think it does the fancy Codable stuff.

jeremy_wiebe7y ago

You might want to check out the benchmark I wrote to compare exactly that.

https://github.com/jeremywiebe/json-performance

eridius7y ago

Swift has JSONEncoder and JSONDecoder types to do Codable, though internally they have to encode to/decode from the Foundation objects that JSONSerialization produces.

vlovich1237y ago

Hey Raph, have you seen https://github.com/bmkor/gason? Seems like a low-cost bridge to a high-performance C++ implementation.

raphlinus7y ago

Hadn't seen that particular wrapper, but if we're going to take on an FFI solution, we're more likely to use Rust for this, and implement more logic than just JSON parsing.

glangdale7y ago

One of the two authors here. Happy to answer questions.

The intent was to open things but not publicize them at this stage but Hacker News seems to find stuff. Wouldn't surprise me if plenty of folks follow Daniel Lemire on Github as his stuff is always interesting.

floofy2227y ago

I see that you are using MMX intrinsics directly, like _mm_sub_pi8, but you are never calling _mm_empty (https://software.intel.com/sites/landingpage/IntrinsicsGuide...) as required by the SysV AMD64 ABI (and pretty much all other ABIs out there).

I think the behavior of all the code that touches is undefined (it breaks the calling convention of the ABI), and while this often results in corrupted floating point values in registers, maybe you won't see much if you are not using the FPU. Still, since the function is inline, chances that this gets inlined somewhere where it could cause trouble seem high.

You might want to look into that.

Also, I wish this would all be written in Rust, there is great portable SIMD support over there. Might make your life easier trying to target other platforms.

EDIT: as burntsushi mentions below, that's not available in stable Rust, but if you want to squeeze out the last once of performance out of the Rust compiler, chances are you won't be using that anyways.

glangdale7y ago

I would be extremely surprised if we were somehow accidentally using MMX; it's not our intention. It is my belief that we are using only AVX2, which, like the 19-year old SSE/SSE2 extension, has its own registers that are independent of the x87 floating point set.

If, once you review our codebase and verify that we are not inadvertently using a 22-year-old SIMD extension but still have undefined behavior, please write an issue on github.

I'm admiring Rust from a distance at this stage. I am comfortable enough with writing bare intrinsics and slapping a giant #ifdef around stuff.

burntsushi7y ago

> there is great portable SIMD support over there

It's not stable yet. The only stable SIMD stuff Rust supports is access to the raw x86 vendor intrinsics.

floofy2227y ago

If they want to squeeze out the last ounce of performance out of the Rust toolchain it probably wouldn't make sense to use stable Rust anyways, so I don't think that's a big downside.

Also, they are already relying on "unstable" (non-standard conforming) C++ features (e.g. the code uses non-standard attributes behind macros, etc.). Using nightly Rust isn't worse than that per se.

Using Rust does have downsides. For the type of code they are writing, the main downside would probably be losing an alternative GCC backend, which might or might not be better than LLVM for their application.

Still, they would win portable SIMD and being able to target not only x86_64 but also ARM, Power, RISCV, WASM, etc., which is always cool to show in research papers.

I'm not suggesting that Rust is a perfect trade-off, only that it's an interesting one depending on what they want to do.

1 more reply

Jerry27y ago

I'm writing an IoT library for devices with tiny microprocessors and have been sending data as JSON or BSON (binary JSON). On the backend, I've been storing reports from IoT devices into a database (MariaDB on AWS). How crazy would it be to just store all the data as JSON files on disk (or S3 bucket) and then batch process them when I need to perform data analysis on them? If a million devices sends dozens of status reports per day, that's going to be a crapton on files... but that might be faster to process than querying the database.

If you or anyone else has some opinions on this, please let me know! I'd really like to learn how people do this type of analysis at scale.

idunno2467y ago

reading lots of small files on s3 or local filesystems is tricky. a million devices with one dozen files, so lets say 12 million files.

One thing locally is each file takes up a full block. So even if you only need 500 bytes of data in a file, and a block is 4kb, youve wasted 3.5kb of space and IO. Multiply that by a million and youre wasting gigabytes of space.

In S3, listing 12 million files takes 12 thousand http(max return is 1000 items). So that would take two minutes if you assume its 10ms per round trip. Let's say you wanted to read each file, and again each read takes 10ms.. youre looking at 1.4 days. Obviously this can be parallelized, but when you look at the raw byte size this is a huge overhead, and this is just to read one day of data.

If you concatenate the files together to get a reasonable size and number of files, raw json on s3 is really powerful. Point athena at it, and you just write sql and it handles the rest, and is serverless. But it does make single row lookups more expensive(supplementing with dynamodb could keep it serverless if single row lookups are frequent).

lots of optimizations will get improvements, like parquet that tobilg mentioned(binary format and columnar), but anything with a decent file size will work.

RhodesianHunter7y ago

Yeah, this is what Kinesis Firehose is for. Send all of your messages there and it will batch them to S3.

twic7y ago

You may enjoy this:

The best way to not lose messages is to minimize the work done by your log receiver. So we did. It receives the uploaded log file chunk and appends it to a file, and that's it. The "file" is actually in a cloud storage system that's more-or-less like S3. When I explained this to someone, they asked why we didn't put it in a Bigtable-like thing or some other database, because isn't a filesystem kinda cheesy? No, it's not cheesy, it's simple. Simple things don't break.

https://apenwarr.ca/log/20190216

tobilg7y ago

We‘re using AWS Kinesis delivery streams to batch incoming JSON messages from IoT devices to Parquet files in S3. Those can directly be read by different AWS services like Redshift, EMR or Athena...

DanFeldman7y ago

We use Athena for all our robotics data, which we ETL into JSON. It's fantastic for queries that are simple time-slice queries, as most are because sensor data is inherently time-series. When more complicated joins are necessary, the performance is there across terabytes, and the cost is very very low, $5 per terabyte scanned (storage costs are another thing).

RhodesianHunter7y ago

What bothers me about Kinesis is that it is prohibitively expensive at scale if you don't compress your data before putting it to Kinesis.

But if you want to use the nice features like parquet conversion your data can't be compressed.

If it could handle compressed data at the same price I would use a lot more of it.

teej7y ago

You’ve kinda just described AWS Athena.

Cthulhu_7y ago

This comment needs to be higher up; Amazon has a service for doing just this, dumping 'dumb' files (like json, csv, etc) into S3 buckets and performing SQL queries on them. No need to have to think about how to store things for future querying.

owenmarshall7y ago

I've used Athena really effectively to solve similar problems. If your data storage is relatively small and/or your queries relatively infrequent, JSON can be a good fit. As one of those dimensions expands, you can decrease costs/increase performance by converting to Parquet and compressing.

andonisus7y ago

I am replying to you as an engineer at an IoT company that provides SaaS in AWS for the data our devices produce. To solve this problem, we transmit our data in a proprietary "raw" binary format that then gets parsed into a protobuf. All data for a given UTC day is appended to this protobuf file and hosted in S3. Retrieving data requires downloading the protobuf file from S3, unmarshalling the protobuf, and finding the entry you care about.

PhilippGille7y ago

If you are considering using plain files instead of a DB server, you could try a compromise and use an embedded key-value store like RocksDB, LevelDB, BadgerDB etc.

It's local storage only, limited query capabilities depending on the DB, but should be extremely fast.

0db532a07y ago

Why not use a timeseries database, like http://btrdb.io?

AtlasBarfed7y ago

well if you need indexed lookups, then use a database

if you're doing "table scan" processing of entire datasets, sure just-a-bunch-of-files would work too.

Databases can be surprisingly fast for things like that, since high performance file i/o is full of tricky/annoying stuff that databases have already optimized for.

BFLpL0QNek7y ago

Depending on your size / budget / needs Snowflake may interest you. https://www.snowflake.com/product/architecture/.

I haven't used it but have been given a presentation by them on it, and it was very very good.

They store data in S3 and use FoundationDB for indexes. You can feed it JSON and it'll index it and let you query it on a massive scale shockingly fast.

Obviously they are not aimed at small hobby projects but if your project has money / serious product depending on your needs it's well worth looking at.

On the S3 cheaper / smaller end you can batch up data daily / weekly etc. So the landing bucket acts as a queue that gets processed creating daily batch files from the small files aggregated together. You can then take the daily batches to create weekly batches etc etc, essentially partitioning. This will reduce the total number of files needed to query. If you use deterministic names based on how you plan to query this can also reduce the number of files you need to list / parse. When batching / re-partitioning the data you can also use the Apache Parquet format to compress a little better + also import in some of the querying tools out there.

SoSKatan7y ago

I've written my fare share of performant code over the years, but this is some next level shit. I've been reading it the last few hours. The only question I have is what is the term for that place considered two degrees past black magic? Since you live there, I have to assume you know the name.

glangdale7y ago

It's not magic. The things that enable writing this kind of code are essentially practice and specialization. Most people have to write code that works all all architectures and where performance is probably less critical than having a simple, workable codebase - so the opportunities to practice writing SIMD code are rare under those constraints.

Unfortunately, the fragmentation of SIMD standards and various pitfalls in implementation (the much ballyhoo'ed "running AVX will make your processor clock to half its speed or something" exaggerations, for example) make a lot of people nervous about putting in the time to commit to developing expertise, which is a shame.

SoSKatan7y ago

Not really a question, but if you ever get to the point of wondering what a good next challenging project would be, consider generalizing some of these techniques into a next generation Yacc / Bison replacement.

Something that can take generic grammer rules and turn it into a high performance parsing engine.

It wouldn't have to support every possible grammar or option. Json isn't that complex of a language, but even a limited set of grammar options in exchange for a performant parser could be of benefit for a very large set of problems.

glangdale7y ago

It's on the list as a research project. It's not obvious to me at this stage that the bottlenecks for more advanced parsers are necessarily going to be in the same place as they are for JSON. It might make more sense to look at a state-of-the-art parser and see if we can contribute a few tricks instead.

2 more replies

anitil7y ago

Oh now that would be an interesting tool

avodonosov7y ago

Any chance to have a similar thing for s-expressions? I parse GBs of them and Common Lisp reader is very slow.

glangdale7y ago

Probably not too hard. It would come down to how easy it is to detect quoting conventions so you don't accidentally parse () chars in strings. JSON is medium-easy. I don't know where the canonical definition of s-expressions you're using comes from (is it just Common Lisp?) so I don't know how this works.

We'd like to have some more examples of formats people care about - I'm interested in generalizing this work. So if you want to followup with more detail please do.

dkersten7y ago

As a clojure user, I care about EDN, but its probably too niche to spend your time on.

https://github.com/edn-format/edn

nojvek7y ago

Yes!!! A generalization for other kinds of simple grammars would be awesome.

On another note. As a js programmer who deals with a ton of json, I would love v8 to adopt some of the tricks into their json parser.

chairmanwow7y ago

Any technical blog articles you have that explain how you were able to ascertain these incredible performance gains?

Kudos on some incredible work! :)

glangdale7y ago

Thank you. More description of the work will be forthcoming but please be patient (for non-sinister reasons).

samsaga27y ago

Jsmn it's already pretty fast and simple. How the hell can be a lot faster than that? I'm very curious.

m_rcin7y ago

The big difference between RapidJson and sajson is surprising to me. When I benchmarked them their performance was comparable: https://github.com/project-gemmi/benchmarking-json . Did you use RapidJson in full-precision mode?

By the way, nativejson-benchmark (from RapidJson) has a nice conformance checker that tries various corner cases. But you probably know it.

glangdale7y ago

More performance details beyond what's on the site will follow (in a while).

We use RapidJSON in the high-performance mode not the funky mode that minimizes FP error (which is some astounding work - I had no idea that strtof was so involved!). Number conversion is not our #1 focus - doing it well is nice, but all implementations have access to the same FP tricks, so you don't really learn much by going wild on this aspect.

At least, you don't unless FP conversion is your focus, in which case you should share your FP conversion code with everyone!

leeter7y ago

You should take a look at std::from_chars IIRC it can completely destroy other parsers within the stdlib because it's not intended to take locale into consideration.

https://en.cppreference.com/w/cpp/utility/from_chars

billconan7y ago

I recently saw people using gpu to parse csv files. there are also other articles on using gpu to parse json. do you think if gpu can perform well on this type of tasks?

glangdale7y ago

I'm not aware of an article that covers a actual implementation or that has a benchmark of performance. As for GPGPU: it's possible. Our first stage of matching is very parallel. But Amdahl's law would, of course, suggest that the serial parsing step would dominate.

I'm interested in this: some aspects of our very serial 'stage 2' (the parsing step) could be made parallel. This would be very interesting. Unfortunately I personally cannot be made parallel, so working on this needs to go into a big queue with a lot of other work.

defen7y ago

How hard would it be to extend the parser to handle arbitrary-precision numbers? Strictly speaking the JSON spec does not require numbers to fit into 64-bit ints / doubles.

glangdale7y ago

Daniel Lemire did most of the work on the number handling, but our general approach was to try to do work that's similar to what the bulk of other libraries do. I believe pretty much everyone throws oversize numbers on the floor.

I don't think it would be hard at all; it would just be extra effort that wasn't needed to run obvious comparisons.

AtlasBarfed7y ago

Jackson benchmarks? I've heard it's twice as fast as rapidjson.

jackdh7y ago

Why did you decide to write this? What was the motivation?

glangdale7y ago

Honestly? I was trolled into it. :-) Unemployed people do weird things.

I can't speak for Daniel's motivation.

popotamonga7y ago

Any plan for wrapper for android?

glangdale7y ago

This would imply an ARM port, I guess, as x86 android isn't much of a thing anymore AFAIK.

I don't think either of us know much about android - not enough to do that. But an ARM port is very interesting.

Since I'm no longer an Intel employee I don't see why I shouldn't skill up and do a Neon port (I got interested in SVE, but since ARM doesn't seem to want to bother releasing cores that run SVE, I'm not going to go too far down that path right now). Neon, on the other hand, is in tons of places. As far as I know all the required permutes, carryless multiplies and various other SIMD bits and pieces are there on Neon. So it's a simple matter of porting.

xfs7y ago

If you're working with json objects with sizes on the higher end quite often you're not going to need the entirety of them, just a small part of them. If that is the workload what then to do is simply parse as little data as possible: skip the validation, locate the relevant bits, and then start parsing, validation and all the stuff. In this optimizing the json scanner/lexer gives much greater improvement than optimizing the parser.

Though this job is trickier than it may look. The logic to extract the "relevant" bits is often dynamic or tied to user input but for the scanner/lexer to be ultrafast it has to be tightly compiled. You can try jitting but libllvm is probably too heavyweight for parsing json.

glangdale7y ago

Jitting is a common tool that people seem to reach for whenever they are parsing or lexing anything at any time. It's really not necessary; there are plenty of fast search methods out there.

JIT approaches make a lot of sense for lex/yacc and their numerous descendants, as these typically need to put a lot of extra logic into the process of parsing. You don't need to JIT just to look up some strings and/or parse a fairly simple hierarchical structure.

xfs7y ago

Parsing itself doesn't need jitting but as soon as you start to use the parsed data to interface with some typed containers the data plumbing consumes much more time than parsing does and drags down all the optimization. For parsing to interact well with static languages jitting is a possible solution to look at.

chubot7y ago

I agree that's a good strategy for big JSON. Do you know of any such "lazy" parsers?

I think the problem is that to extract arbitrary keys, you really need to parse the whole thing, although you don't need to materialize nodes for the whole thing.

But if you have big JSON with a given schema, you may be able to skip things lexically. You basically need to count {} and [], while taking into account " and \ within quoted strings.

That doesn't seem too hard. I think a tiny bit of http://re2c.org/ could do a good job of it.

philbo7y ago

For node.js, I wrote a lib that can selectively parse JSON subtrees:

https://gitlab.com/philbooth/bfj

The specific function of interest here is `bfj.match`, which takes a readable stream and a selector as arguments:

https://gitlab.com/philbooth/bfj#how-do-i-selectively-parse-...

It still walks the full tree like a regular parser, but just avoids creating any data items unless the selector matches. Though there is an outstanding issue to support JSONPath in the selector, currently it only matches individual keys and values.

kristjansson7y ago

It’s not exactly the lazy parser you describe, but Sparser[1] builds filters to exclude json lines/files that can’t contain what you’re looking for, and only parses those that might.

The Morning Paper’s writeup[2] from last year provides a good summary

[1]: http://www.vldb.org/pvldb/vol11/p1576-palkar.pdf [2]: https://blog.acolyer.org/2018/08/20/filter-before-you-parse-...

glangdale7y ago

This work is somewhat orthogonal to ours as it assumes that you can locate JSON records without doing parsing; if I remember correctly, it groups JSON records as lines. If your JSON has been formatted to conform to this, I suppose it would be quite effective.

glangdale7y ago

That's what our first stage does, pretty much. I would imagine we do it way faster than re2c would do it.

Parsing the entire document lock stock and barrel is an easier thing to write about and benchmark. The problem is with skipping around and pulling out bits of JSON from a benchmarking framework is that attempting to present such data often amounts to "hey, we asked ourselves a question and then we got a really good answer for it!". It's hard to picture what a 'typical' query for some field over a JSON document would look like. Conversely, it's pretty easy to know when you finished parsing the Whole Thing.

xfs7y ago

> It's hard to picture what a 'typical' query for some field over a JSON document would look like.

Exactly. A "query" would have to define not only the path, type of the field in the source data but also the type/interface of where you want to put that data. Combining dynamic queries and typed data you get a fairly tricky problem, which is why I said this is tricky. I worked on a similar thing for protobuf and jitting was a solution I looked into (in that project libllvm was too unwieldy to use).

xfs7y ago

I'm not sure what you mean by arbitrary? What parsing in this case means e.g. turning a string of digits into a ieee754 float number in memory. I think this project is meant to accelerate this part with SIMD, but a greater improvement can be obtained by simply not doing this for as much data as possible. If the actual materialized data constitutes a small part of the original, there should be ways to do minimum work for the rest.

spockz7y ago

In jvm-land circe-fs2[1] is a streaming Parser.

[1]: https://github.com/circe/circe-fs2/blob/master/README.md

androidgirl7y ago

Depending on usecase, the JSON lines format can make this into a pretty simple task! Obviously has to fit in with one's data structure though.

jillesvangurp7y ago

Number handling looks like it would be a problem. There are Test suites for json parsers and lots of parsers that fail a lot of these tests. Check e.g. https://github.com/nst/JSONTestSuite which checks compliance against RFC 8259.

Publishing results against this could be useful both for assessing how good this parser is and establishing and documenting any known issues. If correctness is not a goal, this can still be fine but finding out your parser of choice doesn't handle common json emitted by other systems can be annoying.

Regarding the numbers, I've run into a few cases where Jackson being able to parse BigIntegers and BigDecimals was very useful to me. Silently rounding to doubles or floats can be lossy and failing on some documents just because the value exceeds max long/in t can be an issue as well.

baybal27y ago

> We store strings as NULL terminated C strings. Thus we implicitly assume that you do not include a NULL character within your string, which is allowed technically speaking if you escape it (\u0000).

I lost count to broken JSON parsers which all fall to that.

groestl7y ago

Yeah, this is unforgivable, and for me makes the whole speed argument void.

Edit: to be fair, they handle a couple of other things, which many similar libraries ignore. I particulary like the support for full 64bit integers. And at least they document their limitation on NULL bytes.

glangdale7y ago

"Unforgivable" is a bit strong. I don't think this is something which invalidates our entire approach - nothing in the algorithm depends on this behavior as the \0 chars don't appear until quite late. Even then, we are not dependent on sighting a \0 in our string normalization and as such we can probably just store a offset+length in our 'tape' structure rather than assuming we have null terminated strings.

Please add an issue on Github.

Edit: I went ahead and added an issue. Seems like something we should fix.

adrianN7y ago

I feel like if you need to parse Gigabytes per second of JSON, you should probably think about using a more efficient serialization format than JSON. Binary formats are not much harder to generate and can save a lot of bandwidth and CPU time.

wongarsu7y ago

I have in the past parsed terabytes of JSON. The specific use case was analysing archived Reddit comments. The Reddit API uses JSON, and somebody [1] runs a server that just dumps them in a file, one line of JSON per comment, and offers them for download (compressed, obviously). So now you end up with Gigabytes of small JSONs per month, and anything you do will be quickly dominated by JSON parsing time.

You could store them in some binary format, but the API response format changed over the years with various fields being added and removed, and either your binary format ends up not much better than JSON or you end up reencoding old comments because the API changed.

1: http://files.pushshift.io/reddit/

nojvek7y ago

The parsed format in tape.md is quite close to the flatbuffer format. Flatbuffer can encode any json file just fine. The parse time is immediate and requires no extra memory.

It’s a great way to store big json files where you only want to access a subset of data very quickly and not load the whole file into memory.

https://google.github.io/flatbuffers/

DougBTX7y ago

> either your binary format ends up not much better than JSON or you end up reencoding old comments because the API changed

Those are other options too, eg, storing the schema separately from the records (then batching records with identical schemas in compact binary files) and defining migration rules between different schemas (eg, if schema A has required field "foo" while schema B has required field "foo" and optional field "bar" then data which follows schema A can be trivially migrated to schema B at read time without needing to reencode on disk).

https://avro.apache.org/docs/current/

groestl7y ago

Maybe they want to convert incoming JSON to a binary serialization format to save bandwith, storage and CPU time on the rest of the pipeline ;)

teej7y ago

That’s a nice sentiment but we don’t always get to choose.

z3t47y ago

I agree. But JSON serialization is very complicated for very little gain. It would make it impossible to do things like opening the json file in an editor to change some property names. So watch out for premature optimization.

oh_sigh7y ago

What if you're ingesting thousands or millions of small feeds? You might not have much control or desire to dictate format to your clients

dmix7y ago

Yeah not everyone, I’d even say the majority of people, are using software parsing libraries where they are in control of the input data format.

dkersten7y ago

For storing stuff yourself, sure, but as a web developer, most data I consume is JSON served by some third-party REST API and the format they serve me is definitely not under my control. Anecdotally, most developers I know or have spoken to are in similar situations for a large portion of their data-processing needs (at least, for stuff that's not in a database, although even in DB's, JSON is increasingly popular for a number of reasons).

Even for output, there is the common case where your clients expect JSON because its the de facto standard and is super accessible (every language has parsers for it), so you have little choice but to serve your data as JSON.

captbaritone7y ago

The readme specifies that it’s not optimized for reading a large number of small files.

glangdale7y ago

This would be an easy extension if you wanted to concatenate the files. The plumbing and API aren't there right now, but it isn't hard to see how to do it.

kccqzy7y ago

I guess the question is, what do you parse it to? I'm guessing definitely not turning objects into std::unordered_map and arrays into std::vector or some such. So how easy it is to use the "parsed" data structure? How easy is it to add an element to some deeply nested array for example?

Falell7y ago

The ParsedJson type is immutable and accessed mutating iterators (up and down the tree, forward and backward through members and indices).

My immediate thought is to compare it to rapidjson, which I've used before. The paradigm of mutating iterators seems awkward at first but should be just as powerful as rapidjson's Value. For example, both approaches end up doing a linear scan to find an object member by name.

The fact that rapidjson supports mutation of Values and simdjson does not has huge implications (as mentioned in the simdjson README scope section), I suspect this tradeoff explains most of the performance differences as I know rapidjson also uses simd internally.

hnaccy7y ago

Is there a reason these fast json libraries seem to favor doing linear scan for object representation?

yoklov7y ago

Faster to build than a hash map, less code (which is also better for icache), etc.

JSON Objects tend to have few enough values that it doesn't matter a ton anyway.

saagarjha7y ago

The data is put into a "ParsedJson" object: https://github.com/lemire/simdjson/blob/master/include/simdj...

scottlamb7y ago

That header mentions a tape.md describing the format. It's really interesting:

https://github.com/lemire/simdjson/blob/master/tape.md

_wmd7y ago

I can't speak for this project, but my own for CSV files ( https://github.com/dw/csvmonkey ) provides a high level interface that allows the tokenized data to be manipulated in-place without full decoding. The interface exported in Python is that of a plain old dictionary with one added magical semantic (lazy decode on element access). The internal representation of the parse result is a simple fixed array of (ptr, size) pairs

Methods like this are used for batch search / summation where only a fraction of the parsed data is actually relevant during any particular run. You'll find similar approaches used in e.g. the row format parser of a database like MongoDB or Postgres

AtlasBarfed7y ago

into a token stream?

wtetzner7y ago

Isn't that just lexing?

westurner7y ago

> Requirements: […] A processor with AVX2 (i.e., Intel processors starting with the Haswell microarchitecture released 2013, and processors from AMD starting with the Rizen)

aristidb7y ago

Also noteworthy that on Intel at least, using AVX/AVX2 reduces the frequency of the CPU for a while. It can even go below base clock.

scottlamb7y ago

iirc, it's complicated. Some instructions don't reduce the frequency; some reduce it a little; some reduce it a lot.

I'm not sure AVX2 is as ubiquitous as the README says: "We assume AVX2 support which is available in all recent mainstream x86 processors produced by AMD and Intel."

I guess "mainstream" is somewhat subjective, but some recent Chromebooks have Celeron processors with no AVX2:

https://us-store.acer.com/chromebook-14-cb3-431-c5fm

https://ark.intel.com/products/91831/Intel-Celeron-Processor...

Ultimatt7y ago

Because someone wanting 2.2GB/s JSON parsing is deploying to a chromebook...

1 more reply

sitkack7y ago

AVX2 also incurs some pretty large penalties for switching between SSE and AVX2. Depending on the amount of time taken in the library between calls, it could be problematic.

This looks mostly applicable to server scenarios where the runtime environment is highly controlled.

1 more reply

ben-schaaf7y ago

I wonder how this compares to fast.json: "Fastest JSON parser in the world is a D project?" (https://news.ycombinator.com/item?id=10430951), both in an implementation/approach sense and in terms of performance.

yeldarb7y ago

Will this work on JSON files that are larger than the available system memory?

Firebase backups are huge JSON files and we haven’t found a good way to deal with them.

There are some “streaming JSON parsers” that we have wrestled with but they are buggy.

glangdale7y ago

Sadly it will not. Arguably we could 'stream' things, but we don't have an API or a use case for it. If you could capture your requirements and put them on an issue on Github, it would be helpful. We're not against the streaming use case, we just don't understand it very well.

nojvek7y ago

Probably not. I requires a memory allocation the size of the file for parsing.

However they have the ability to build a tape out of the json and find the interesting marks. Perhaps it can be adapted to make a fast parser than only parses the relevant stuff but zooms through the large file in blocks.

xnormal7y ago

Any chance of something similar for CSV? (full RFC-4180 including quotes, escaping etc).

Terabytes of "big data" get passed around as CSV.

glangdale7y ago

CSV is on our list; this is a simpler task than JSON due to the absence of arbitrary nesting.

imtringued7y ago

I doubt someone using CSV for big data is going to follow that rule...

carlmr7y ago

What do you mean? It's not a rule, it's just not possible in the CSV format to have arbitrary nesting.

blaisio7y ago

It's probably relevant to mention https://github.com/BurntSushi/rust-csv. It uses a state machine (which seems to be the author's expertise) to parse CSVs really fast. Based on some other work, you can do better if you use some of the new SIMD instructions.

badeu7y ago

I've developped a full RFC compliant CSV parser with Python bindings and supporting SSE4 to AVX-512 instruction sets, however i'm struggling with my hierarchy to open-source it at the moment.

But, the goal of my message is not to tease you with an unavailable code. It's just to say it is a lot more simpler to write a CSV parser than a JSON parser.

So, do not hesitate to write one yourself ! It's easy and a nice way to introduce yourself to SIMD instructions.

fooyc7y ago

What happens of the parsed data ? Do the benchmarks account for the time to access that data after parsing ?

ftp-bit7y ago

Perhaps I'm misunderstanding or don't have a good enough grasp of this, but, in what circumstance would you need to parse gigabytes? I've only seen it be used in config files, so...

userbinator7y ago

What usually happens is someone creates an API, one which did not initially have to handle much data, and then it just grew over time. (I guess it's similar to how a lot of the Internet's early application-layer protocols like HTTP, SMTP, etc. are text-based --- the text format was initially more "convenient" for a variety of reasons, but obviously is not very efficient at scale.)

Or, perhaps a more common scenario today, it was designed by people who simply had no knowledge of binary protocols or efficiency at all --- not too long ago I had to deal with an API which returned a binary file, but instead of simply sending the bytes directly, it decided to send a JSON object containing one array, whose elements were strings, and each string was... a hex digit. Instead of sending "Hello world" it would send '{"data":["4","8"," ","6","5"," ","6","C"," " ... '

detaro7y ago

Log files? More and more places are switching to easily machine-parsable logs to run statistics and checks over, and JSON is a common format (e.g. because it's still somewhat human-readable and will work over logging infrastructure set up to transport lines of text)

glangdale7y ago

There are some quite big JSON files out there; you might also be interested in parsing megabytes but not spending more than 1ms to get through it.

maliker7y ago

If this kind of work is interesting to you, you might like Daniel Lemire's blog (https://lemire.me/blog/).

He's a professor, but his work is highly applied and immediately usable. He manages to find and demonstrate a lot of code where we assume the big-O performance, but the reality of modern processors and caching (etc.) mean very difference performance in practice.

sbr4647y ago

Thanks for posting. I've been working with lidar/robotic data more recently and it's nice to work with JSON directly, when the performance is good enough.

avmich7y ago

> All JSON is JavaScript, but not all JavaScript is JSON

Really? I thought they diverged specifications long enough ago (though using those extras could be discouraged in some cases).

chubot7y ago

The JSON spec [1] never had any updates, so it couldn't have diverged.

Kudos to Douglas Crockford for keeping it simple. I wish more standards committees would take a cue from him. (Looking at ECMAScript [2] and C++.)

There's been a tremendous amount of growth and value around JSON precisely because it's so simple and easy to implement.

People complain about the lack of comments and trailing commas, but I think those are really expanding on the initial use case of JSON, and the benefit isn't worth cost of change. JSON does some things super well, other things marginally well, and some not at all, and that's working as intended.

You can always make something separate to cover those use cases, and that seems to have happened with TOML and so forth.

(I recall there was an RFC that cleaned up ambiguities in Crockford's web page, but it just clarified things. No new features were added. So JSON is still as much of a subset of JavaScript as it ever was. On the other hand, JavaScript itself has grown wildly out of control.)

[1] http://json.org/

[2] https://news.ycombinator.com/item?id=18766361

eesmith7y ago

https://en.wikipedia.org/wiki/JSON#Data_portability_issues :

> Although Douglas Crockford originally asserted that JSON is a strict subset of JavaScript, his specification actually allows valid JSON documents that are invalid JavaScript. Specifically, JSON allows the Unicode line terminators U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR to appear unescaped in quoted strings, while ECMAScript 2018 and older does not.

empyrical7y ago

That bit of incompatibility will be going away when this proposal is implemented, however:

https://github.com/tc39/proposal-json-superset

1 more reply

chubot7y ago

Yeah I remember that quirk, and that's why I said it's "as much of a subset as it ever was". :) Because of this issue, it was technically never a subset.

But almost all real JSON documents are subsets of JavaScript, unless they happen to have those characters.

And the salient point is that if JSON never changes, then no further divergence from JavaScript is possible.

1 more reply

daveevad7y ago

> JSON allows the Unicode line terminators U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR to appear unescaped in quoted strings, while ECMAScript 2018 and older does not.

My code has parsed a lot JSON and that is new data to me. Thank you for that!

Do you know the historical reasoning for this particular deviation? Are there any infamous bugs or common use cases this departure impacts?

avmich7y ago

Agree.

This is another useful resource, discussed here already - http://seriot.ch/parsing_json.php - which lists relevant standards. But "the" standard is static, so divergence, is any, is with other standards (different from json.org) vs. evolving JavaScript.

wtetzner7y ago

> People complain about the lack of comments and trailing commas,

Yeah, I don't think JSON should include those things. I think the lack of comments makes JSON a poor format for config files, but that just means you should use another format for config files. JSON is good for machine-to-machine communication.

dlbucci7y ago

Basically saying any valid-format JSON is valid JS as well. But JSON doesn't have any programming features (or the nice things like non-quoted keys/trailing commas)

groestl7y ago

This is a dangerous assumption to make, and one that bit us a while ago when using trigger.io for an app.

We had a lot of user supplied data in the strings of our API responses, some of it copied from Word documents and were ridden with U+2028 and U+2029 whitespace. Turns out that on iOS, the trigger.io library makes the all too popular assumption that any well-formated JSON can be interpreted as JS, and parses the responses with "eval", thus turning all those unicode characters _within JSON strings_ into newlines!

fulafel7y ago

What's the current state of the art in doing this on GPU?

glangdale7y ago

To my knowledge, it is limited to posting "Towards JSON Parsing on a GPU" type articles. Writing that sort of article is easy and fun, without the tedious burden of implementing things.

tenken7y ago

I'm curious how fast the sqlite json extension is for validation and manipulation of json data when compared to this library.

kitd7y ago

OT, but I notice it can be run by #include-ing the simdjson.cpp file. How common is this in CPP projects?

Erwin7y ago

It seems like there are quite a few single-header C++ libraries: https://github.com/nothings/single_file_libs

The people complaining about dependency management in Python should try doing it in C++; there seems to be half a dozen competing ones. And three times as many build systems.

vkaku7y ago

Honestly, this is a cool hack. But it's not the best way to shuttle that much data around.

It's a hammer on rocket fuel.

hrdwdmrbl7y ago

Would it be possible to make a native module out of this for node?

sbr4647y ago

Here's the node bindings for rapid json, I'm assuming it would be similar.

https://github.com/matthewpalmer/node-rapidjson

hrdwdmrbl7y ago

Thank you!

Though from the readme on that module the dev says "it turns out that you’re better off using the normal Node.js/V8 implementation unless you’re operating on huge JSON.

... the bridging from V8 to C++ is a bit too costly at this stage."

sbr4647y ago

That was two years ago though, not sure what improvements the N-API has in newer versions of nodejs.

iamleppert7y ago

Is this faster than the browser’s native parsing speed I assume?

achalkley7y ago

With this work on an Arduino?

abhorrence7y ago

This code in particular won’t, since it relies on a particular extension of the x86 instruction set. I don’t believe Arduino compatible chips have simd instructions, but if they do, a similar approach could be taken.

glangdale7y ago

I'm not aware of any SIMD-capable Arduino chips; even when Quark was a thing, it didn't support SIMD.

It's possible to do SWAR (SIMD Within A Register) tricks to try to substitute, but on a 32-bit processor (or even a 64-bit processor) I doubt our techniques would look good. In Hyperscan, my regex project, we used SWAR for simple things (character scans) but I doubt that simdjson would work well if you tried to make it into swarjson. :-)

fulafel7y ago

I wonder if it's possible to do something with bitslicing?

j / k navigate · click thread line to collapse

196 comments

raphlinus7y ago

[1]: https://github.com/xi-editor/xi-mac/issues/102

eridius7y ago

marton787y ago

Why does Xi use JSON in the first place? It would be easier and faster to use a binary format, e.g. Protobufs, Flatbuffers or if the semantics of JSON is needed: CBOR.

aratno7y ago

From “Design Decisions”[1]:

1: https://github.com/xi-editor/xi-editor/blob/master/README.md...

nothrabannosir7y ago

So is it too slow or not?

3 more replies

anderspitman7y ago

I'm going off topic at this point but I'd think for a native app the main advantages of a binary format would be the static typing and code generation that come from using an IDL.

2 more replies

Skinney7y ago

kragen7y ago

spc4767y ago

There are a few more advantages to CBOR:

1) there's a distinction between integers and floating point values;

2) you can semantically tag values (yes, this is a text string, but treat it as a date; this is a binary string, but treat it as a big number; etc.);

3) you can have maps with non-text keys.

[1] https://github.com/spc476/CBOR

1 more reply

mpweiher7y ago

saagarjha7y ago

I ran one of the Codable benchmarks in instruments, and here's what the top functions were:

  19.98 s   swift_getGenericMetadata
  19.15 s   newJSONString
  16.17 s   objc_msgSend
  15.33 s   _swift_release_(swift::HeapObject*)
  14.45 s   tiny_malloc_should_clear
  12.81 s   _swift_retain_(swift::HeapObject*)
  11.28 s   searchInConformanceCache(swift::TargetMetadata<swift::InProcess> const*, swift::TargetProtocolDescriptor<swift::InProcess> const*)
  10.46 s   swift_dynamicCastImpl(swift::OpaqueValue*, swift::OpaqueValue*, swift::TargetMetadata<swift::InProcess> const*, swift::TargetMetadata<swift::InProcess> const*, swift::DynamicCastFlags)

So it looks like a lot of the time is going into memory management or the Swift runtime performing type checking.

raphlinus7y ago

saagarjha7y ago

2 more replies

Cthulhu_7y ago

saagarjha7y ago

mpweiher7y ago

Maybe I should write that parser...

[1] https://www.google.com/search?hl=en&q=marcel%20weiher%20perf...

[2] https://www.amazon.com/gp/product/0321842847/

[3] https://github.com/mpw/Objective-XML

[4] https://blog.metaobject.com/2010/05/xml-performance-revisite...

[5] https://github.com/mpw/MPWFoundation/blob/master/Collections...

[6] https://github.com/mpw/MPWFoundation/blob/master/Collections...

[7] https://github.com/mpw/MPWFoundation/blob/master/Streams.sub...

gritzko7y ago

I settled on a tabular-log format, which is streamed and immediately consumed most of the time, no intermediate object structures.

Then, that "text vs binary" distinction became mostly moot. The binary is slightly more efficient, but grossly less readable, so no big gain, unless at grand scale.

[1] http://replicated.cc

azinman27y ago

What are you using? Have you tried NSJSONSerialization? It’s quite fast (am very curious how it shows in these benchmarks), but I don’t think it does the fancy Codable stuff.

jeremy_wiebe7y ago

You might want to check out the benchmark I wrote to compare exactly that.

https://github.com/jeremywiebe/json-performance

eridius7y ago

Swift has JSONEncoder and JSONDecoder types to do Codable, though internally they have to encode to/decode from the Foundation objects that JSONSerialization produces.

vlovich1237y ago

Hey Raph, have you seen https://github.com/bmkor/gason? Seems like a low-cost bridge to a high-performance C++ implementation.

raphlinus7y ago

Hadn't seen that particular wrapper, but if we're going to take on an FFI solution, we're more likely to use Rust for this, and implement more logic than just JSON parsing.

glangdale7y ago

One of the two authors here. Happy to answer questions.

floofy2227y ago

You might want to look into that.

Also, I wish this would all be written in Rust, there is great portable SIMD support over there. Might make your life easier trying to target other platforms.

glangdale7y ago

If, once you review our codebase and verify that we are not inadvertently using a 22-year-old SIMD extension but still have undefined behavior, please write an issue on github.

I'm admiring Rust from a distance at this stage. I am comfortable enough with writing bare intrinsics and slapping a giant #ifdef around stuff.

burntsushi7y ago

> there is great portable SIMD support over there

It's not stable yet. The only stable SIMD stuff Rust supports is access to the raw x86 vendor intrinsics.

floofy2227y ago

If they want to squeeze out the last ounce of performance out of the Rust toolchain it probably wouldn't make sense to use stable Rust anyways, so I don't think that's a big downside.

Also, they are already relying on "unstable" (non-standard conforming) C++ features (e.g. the code uses non-standard attributes behind macros, etc.). Using nightly Rust isn't worse than that per se.

Still, they would win portable SIMD and being able to target not only x86_64 but also ARM, Power, RISCV, WASM, etc., which is always cool to show in research papers.

I'm not suggesting that Rust is a perfect trade-off, only that it's an interesting one depending on what they want to do.

1 more reply

Jerry27y ago

If you or anyone else has some opinions on this, please let me know! I'd really like to learn how people do this type of analysis at scale.

idunno2467y ago

reading lots of small files on s3 or local filesystems is tricky. a million devices with one dozen files, so lets say 12 million files.

lots of optimizations will get improvements, like parquet that tobilg mentioned(binary format and columnar), but anything with a decent file size will work.

RhodesianHunter7y ago

Yeah, this is what Kinesis Firehose is for. Send all of your messages there and it will batch them to S3.

twic7y ago