That said, it also includes an event-based parser (called JSONDecoder), so if you want to handle events in order to decode into your own data structure and skip the intermediate JSON data structure, you might be able to get faster than JSONSerialization that way.
> JSON. The protocol for front-end / back-end communication, as well as between the back-end and plug-ins, is based on simple JSON messages. I considered binary formats, but the actual improvement in performance would be completely in the noise. Using JSON considerably lowers friction for developing plug-ins, as it’s available out of the box for most modern languages, and there are plenty of the libraries available for the other ones.
1: https://github.com/xi-editor/xi-editor/blob/master/README.md...
19.98 s swift_getGenericMetadata
19.15 s newJSONString
16.17 s objc_msgSend
15.33 s _swift_release_(swift::HeapObject*)
14.45 s tiny_malloc_should_clear
12.81 s _swift_retain_(swift::HeapObject*)
11.28 s searchInConformanceCache(swift::TargetMetadata<swift::InProcess> const*, swift::TargetProtocolDescriptor<swift::InProcess> const*)
10.46 s swift_dynamicCastImpl(swift::OpaqueValue*, swift::OpaqueValue*, swift::TargetMetadata<swift::InProcess> const*, swift::TargetMetadata<swift::InProcess> const*, swift::DynamicCastFlags)
So it looks like a lot of the time is going into memory management or the Swift runtime performing type checking.I have given a bunch of talks[1] on this topic, there's also a chapter in my iOS/macOS performance book[2], which I really recommend if you want to understand this particular topic. I did really fast XML[3][4], CSV[5] and binary plist parsers[6] for Cocoa and also a fast JSON serialiser[7]. All of these are usually around an order of magnitude faster than their Apple equivalents.
Sadly, I haven't gotten around to doing a JSON parser. One reason for this is that parsing the JSON at character level is actually the smaller problem, performance-wise, same as for XML. Performance tends to be largely determined by what you create as a result. If you crate generic Foundation/Swift dictionaries/arrays/etc. you have already lost. The overhead of these generic data structure completely overwhelms the cost of scanning a few bytes.
So you need something more akin to a steaming interface, and if you create objects you must create them directly, without generic temporary objects. This is where XML is easier, because it has an opening tag that you can use to determine what object to create. With JSON, you get "{" so basically you have to know what structure level corresponds to what objects.
Maybe I should write that parser...
[1] https://www.google.com/search?hl=en&q=marcel%20weiher%20perf...
[2] https://www.amazon.com/gp/product/0321842847/
[3] https://github.com/mpw/Objective-XML
[4] https://blog.metaobject.com/2010/05/xml-performance-revisite...
[5] https://github.com/mpw/MPWFoundation/blob/master/Collections...
[6] https://github.com/mpw/MPWFoundation/blob/master/Collections...
[7] https://github.com/mpw/MPWFoundation/blob/master/Streams.sub...
I settled on a tabular-log format, which is streamed and immediately consumed most of the time, no intermediate object structures.
Then, that "text vs binary" distinction became mostly moot. The binary is slightly more efficient, but grossly less readable, so no big gain, unless at grand scale.
The intent was to open things but not publicize them at this stage but Hacker News seems to find stuff. Wouldn't surprise me if plenty of folks follow Daniel Lemire on Github as his stuff is always interesting.
I think the behavior of all the code that touches is undefined (it breaks the calling convention of the ABI), and while this often results in corrupted floating point values in registers, maybe you won't see much if you are not using the FPU. Still, since the function is inline, chances that this gets inlined somewhere where it could cause trouble seem high.
You might want to look into that.
Also, I wish this would all be written in Rust, there is great portable SIMD support over there. Might make your life easier trying to target other platforms.
EDIT: as burntsushi mentions below, that's not available in stable Rust, but if you want to squeeze out the last once of performance out of the Rust compiler, chances are you won't be using that anyways.
If, once you review our codebase and verify that we are not inadvertently using a 22-year-old SIMD extension but still have undefined behavior, please write an issue on github.
I'm admiring Rust from a distance at this stage. I am comfortable enough with writing bare intrinsics and slapping a giant #ifdef around stuff.
It's not stable yet. The only stable SIMD stuff Rust supports is access to the raw x86 vendor intrinsics.
If you or anyone else has some opinions on this, please let me know! I'd really like to learn how people do this type of analysis at scale.
One thing locally is each file takes up a full block. So even if you only need 500 bytes of data in a file, and a block is 4kb, youve wasted 3.5kb of space and IO. Multiply that by a million and youre wasting gigabytes of space.
In S3, listing 12 million files takes 12 thousand http(max return is 1000 items). So that would take two minutes if you assume its 10ms per round trip. Let's say you wanted to read each file, and again each read takes 10ms.. youre looking at 1.4 days. Obviously this can be parallelized, but when you look at the raw byte size this is a huge overhead, and this is just to read one day of data.
If you concatenate the files together to get a reasonable size and number of files, raw json on s3 is really powerful. Point athena at it, and you just write sql and it handles the rest, and is serverless. But it does make single row lookups more expensive(supplementing with dynamodb could keep it serverless if single row lookups are frequent).
lots of optimizations will get improvements, like parquet that tobilg mentioned(binary format and columnar), but anything with a decent file size will work.
The best way to not lose messages is to minimize the work done by your log receiver. So we did. It receives the uploaded log file chunk and appends it to a file, and that's it. The "file" is actually in a cloud storage system that's more-or-less like S3. When I explained this to someone, they asked why we didn't put it in a Bigtable-like thing or some other database, because isn't a filesystem kinda cheesy? No, it's not cheesy, it's simple. Simple things don't break.
It's local storage only, limited query capabilities depending on the DB, but should be extremely fast.
if you're doing "table scan" processing of entire datasets, sure just-a-bunch-of-files would work too.
Databases can be surprisingly fast for things like that, since high performance file i/o is full of tricky/annoying stuff that databases have already optimized for.
I haven't used it but have been given a presentation by them on it, and it was very very good.
They store data in S3 and use FoundationDB for indexes. You can feed it JSON and it'll index it and let you query it on a massive scale shockingly fast.
Obviously they are not aimed at small hobby projects but if your project has money / serious product depending on your needs it's well worth looking at.
On the S3 cheaper / smaller end you can batch up data daily / weekly etc. So the landing bucket acts as a queue that gets processed creating daily batch files from the small files aggregated together. You can then take the daily batches to create weekly batches etc etc, essentially partitioning. This will reduce the total number of files needed to query. If you use deterministic names based on how you plan to query this can also reduce the number of files you need to list / parse. When batching / re-partitioning the data you can also use the Apache Parquet format to compress a little better + also import in some of the querying tools out there.
Unfortunately, the fragmentation of SIMD standards and various pitfalls in implementation (the much ballyhoo'ed "running AVX will make your processor clock to half its speed or something" exaggerations, for example) make a lot of people nervous about putting in the time to commit to developing expertise, which is a shame.
Something that can take generic grammer rules and turn it into a high performance parsing engine.
It wouldn't have to support every possible grammar or option. Json isn't that complex of a language, but even a limited set of grammar options in exchange for a performant parser could be of benefit for a very large set of problems.
We'd like to have some more examples of formats people care about - I'm interested in generalizing this work. So if you want to followup with more detail please do.
Kudos on some incredible work! :)
By the way, nativejson-benchmark (from RapidJson) has a nice conformance checker that tries various corner cases. But you probably know it.
We use RapidJSON in the high-performance mode not the funky mode that minimizes FP error (which is some astounding work - I had no idea that strtof was so involved!). Number conversion is not our #1 focus - doing it well is nice, but all implementations have access to the same FP tricks, so you don't really learn much by going wild on this aspect.
At least, you don't unless FP conversion is your focus, in which case you should share your FP conversion code with everyone!
I'm interested in this: some aspects of our very serial 'stage 2' (the parsing step) could be made parallel. This would be very interesting. Unfortunately I personally cannot be made parallel, so working on this needs to go into a big queue with a lot of other work.
I don't think it would be hard at all; it would just be extra effort that wasn't needed to run obvious comparisons.
I can't speak for Daniel's motivation.
I don't think either of us know much about android - not enough to do that. But an ARM port is very interesting.
Since I'm no longer an Intel employee I don't see why I shouldn't skill up and do a Neon port (I got interested in SVE, but since ARM doesn't seem to want to bother releasing cores that run SVE, I'm not going to go too far down that path right now). Neon, on the other hand, is in tons of places. As far as I know all the required permutes, carryless multiplies and various other SIMD bits and pieces are there on Neon. So it's a simple matter of porting.
Though this job is trickier than it may look. The logic to extract the "relevant" bits is often dynamic or tied to user input but for the scanner/lexer to be ultrafast it has to be tightly compiled. You can try jitting but libllvm is probably too heavyweight for parsing json.
JIT approaches make a lot of sense for lex/yacc and their numerous descendants, as these typically need to put a lot of extra logic into the process of parsing. You don't need to JIT just to look up some strings and/or parse a fairly simple hierarchical structure.
I think the problem is that to extract arbitrary keys, you really need to parse the whole thing, although you don't need to materialize nodes for the whole thing.
But if you have big JSON with a given schema, you may be able to skip things lexically. You basically need to count {} and [], while taking into account " and \ within quoted strings.
That doesn't seem too hard. I think a tiny bit of http://re2c.org/ could do a good job of it.
https://gitlab.com/philbooth/bfj
The specific function of interest here is `bfj.match`, which takes a readable stream and a selector as arguments:
https://gitlab.com/philbooth/bfj#how-do-i-selectively-parse-...
It still walks the full tree like a regular parser, but just avoids creating any data items unless the selector matches. Though there is an outstanding issue to support JSONPath in the selector, currently it only matches individual keys and values.
The Morning Paper’s writeup[2] from last year provides a good summary
[1]: http://www.vldb.org/pvldb/vol11/p1576-palkar.pdf [2]: https://blog.acolyer.org/2018/08/20/filter-before-you-parse-...
Parsing the entire document lock stock and barrel is an easier thing to write about and benchmark. The problem is with skipping around and pulling out bits of JSON from a benchmarking framework is that attempting to present such data often amounts to "hey, we asked ourselves a question and then we got a really good answer for it!". It's hard to picture what a 'typical' query for some field over a JSON document would look like. Conversely, it's pretty easy to know when you finished parsing the Whole Thing.
[1]: https://github.com/circe/circe-fs2/blob/master/README.md
Publishing results against this could be useful both for assessing how good this parser is and establishing and documenting any known issues. If correctness is not a goal, this can still be fine but finding out your parser of choice doesn't handle common json emitted by other systems can be annoying.
Regarding the numbers, I've run into a few cases where Jackson being able to parse BigIntegers and BigDecimals was very useful to me. Silently rounding to doubles or floats can be lossy and failing on some documents just because the value exceeds max long/in t can be an issue as well.
I lost count to broken JSON parsers which all fall to that.
Edit: to be fair, they handle a couple of other things, which many similar libraries ignore. I particulary like the support for full 64bit integers. And at least they document their limitation on NULL bytes.
Please add an issue on Github.
Edit: I went ahead and added an issue. Seems like something we should fix.
You could store them in some binary format, but the API response format changed over the years with various fields being added and removed, and either your binary format ends up not much better than JSON or you end up reencoding old comments because the API changed.
It’s a great way to store big json files where you only want to access a subset of data very quickly and not load the whole file into memory.
Those are other options too, eg, storing the schema separately from the records (then batching records with identical schemas in compact binary files) and defining migration rules between different schemas (eg, if schema A has required field "foo" while schema B has required field "foo" and optional field "bar" then data which follows schema A can be trivially migrated to schema B at read time without needing to reencode on disk).
My immediate thought is to compare it to rapidjson, which I've used before. The paradigm of mutating iterators seems awkward at first but should be just as powerful as rapidjson's Value. For example, both approaches end up doing a linear scan to find an object member by name.
The fact that rapidjson supports mutation of Values and simdjson does not has huge implications (as mentioned in the simdjson README scope section), I suspect this tradeoff explains most of the performance differences as I know rapidjson also uses simd internally.
Methods like this are used for batch search / summation where only a fraction of the parsed data is actually relevant during any particular run. You'll find similar approaches used in e.g. the row format parser of a database like MongoDB or Postgres
I'm not sure AVX2 is as ubiquitous as the README says: "We assume AVX2 support which is available in all recent mainstream x86 processors produced by AMD and Intel."
I guess "mainstream" is somewhat subjective, but some recent Chromebooks have Celeron processors with no AVX2:
https://us-store.acer.com/chromebook-14-cb3-431-c5fm
https://ark.intel.com/products/91831/Intel-Celeron-Processor...
Firebase backups are huge JSON files and we haven’t found a good way to deal with them.
There are some “streaming JSON parsers” that we have wrestled with but they are buggy.
However they have the ability to build a tape out of the json and find the interesting marks. Perhaps it can be adapted to make a fast parser than only parses the relevant stuff but zooms through the large file in blocks.
Terabytes of "big data" get passed around as CSV.
But, the goal of my message is not to tease you with an unavailable code. It's just to say it is a lot more simpler to write a CSV parser than a JSON parser.
So, do not hesitate to write one yourself ! It's easy and a nice way to introduce yourself to SIMD instructions.
Or, perhaps a more common scenario today, it was designed by people who simply had no knowledge of binary protocols or efficiency at all --- not too long ago I had to deal with an API which returned a binary file, but instead of simply sending the bytes directly, it decided to send a JSON object containing one array, whose elements were strings, and each string was... a hex digit. Instead of sending "Hello world" it would send '{"data":["4","8"," ","6","5"," ","6","C"," " ... '
He's a professor, but his work is highly applied and immediately usable. He manages to find and demonstrate a lot of code where we assume the big-O performance, but the reality of modern processors and caching (etc.) mean very difference performance in practice.
Really? I thought they diverged specifications long enough ago (though using those extras could be discouraged in some cases).
Kudos to Douglas Crockford for keeping it simple. I wish more standards committees would take a cue from him. (Looking at ECMAScript [2] and C++.)
There's been a tremendous amount of growth and value around JSON precisely because it's so simple and easy to implement.
People complain about the lack of comments and trailing commas, but I think those are really expanding on the initial use case of JSON, and the benefit isn't worth cost of change. JSON does some things super well, other things marginally well, and some not at all, and that's working as intended.
You can always make something separate to cover those use cases, and that seems to have happened with TOML and so forth.
(I recall there was an RFC that cleaned up ambiguities in Crockford's web page, but it just clarified things. No new features were added. So JSON is still as much of a subset of JavaScript as it ever was. On the other hand, JavaScript itself has grown wildly out of control.)
[1] http://json.org/
> Although Douglas Crockford originally asserted that JSON is a strict subset of JavaScript, his specification actually allows valid JSON documents that are invalid JavaScript. Specifically, JSON allows the Unicode line terminators U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR to appear unescaped in quoted strings, while ECMAScript 2018 and older does not.
This is another useful resource, discussed here already - http://seriot.ch/parsing_json.php - which lists relevant standards. But "the" standard is static, so divergence, is any, is with other standards (different from json.org) vs. evolving JavaScript.
Yeah, I don't think JSON should include those things. I think the lack of comments makes JSON a poor format for config files, but that just means you should use another format for config files. JSON is good for machine-to-machine communication.
We had a lot of user supplied data in the strings of our API responses, some of it copied from Word documents and were ridden with U+2028 and U+2029 whitespace. Turns out that on iOS, the trigger.io library makes the all too popular assumption that any well-formated JSON can be interpreted as JS, and parses the responses with "eval", thus turning all those unicode characters _within JSON strings_ into newlines!
The people complaining about dependency management in Python should try doing it in C++; there seems to be half a dozen competing ones. And three times as many build systems.
It's a hammer on rocket fuel.
Though from the readme on that module the dev says "it turns out that you’re better off using the normal Node.js/V8 implementation unless you’re operating on huge JSON.
... the bridging from V8 to C++ is a bit too costly at this stage."
It's possible to do SWAR (SIMD Within A Register) tricks to try to substitute, but on a 32-bit processor (or even a 64-bit processor) I doubt our techniques would look good. In Hyperscan, my regex project, we used SWAR for simple things (character scans) but I doubt that simdjson would work well if you tried to make it into swarjson. :-)