Protocol Buffers (and I think Thrift, and maybe Avro) are sort of like C or C++: you declare your types ahead of time, and then you take some binary payload and "cast" it (parse it actually) into your predefined type. If those bytes weren't actually serialized as that type, you'll get garbage. On the plus side, the fact that you declared your types statically means that you get lots of useful compile-time checking and everything is really efficient. It's also nice because you can use the schema file (ie. .proto files) to declare your schema formally and document everything.
JSON and Ion are more like a Python/Javascript object/dict. Objects are just attribute-value bags. If you say it has field fooBar at runtime, now it does! When you parse, you don't have to know what message type you are expecting, because the key names are all encoded on the wire. On the downside, if you misspell a key name, nothing is going to warn you about it. And things aren't quite as efficient because the general representation has to be a hash map where every value is dynamically typed. On the plus side, you never have to worry about losing your schema file.
I think this is a case where "strongly typed" isn't the clearest way to think about it. It's "statically typed" vs. "dynamically typed" that is the useful distinction.
{"start": "2007-03-01"}
Is that a timestamp? Maybe! Does it support a time within the day? Perhaps I can write "2007-03-01T13:00:00" in ISO 8601 format if we're lucky. Can I supply a time zone? Who knows for sure? It's weakly typed data. The actual specification of that type of that field lives in a layer on top of JSON, if it's even specified at all. It might be "specified" only in terms of what the applications that handle it can parse and generate. I could drop that value into Excel and treat it as all sorts of different things.Ion by comparison has a specific data type for timestamps defined in the spec [1]. The timestamp has a canonical representation in both text and binary form. For this reason, I know that "2007-02-23T20:14:33.Z" and "2007-02-23T12:14:33.079-08:00" are valid Ion timestamp text values. In this instance I would describe Ion as strongly typed and JSON as weakly typed. Or, as the Ion documentation puts it, "richly typed".
To make an analogy, weakly typed is the Excel cell that can store whatever value you put in it, or the PHP integer 1 which is considered equal to "1" (loose equality). Strongly typed is the relational database row with a column described precisely by the table schema. Weakly typed is the CSV file; strongly typed is the Ion document.
However I don't think it's accurate to say that the typing of Ion is any "stronger." Both Ion and JSON are fully dynamically typed, which means that types are attached to every value on the wire. It's just that without an actual timestamp type in JSON, you have to encode timestamp data into a more generic type.
poorly typed <-------------> richly typed
dynamic CSV, INI JSON YAML, Ion
static Bencode, ASN.1 Protobuf
What I mean by "richly typed" is that you would never read a timestamp off the wire and not know that it's a timestamp. By comparison, with CSV or INI files, you just have strings everywhere. Formats on the richly typed side have separate and explicit types for binary blobs and text, for example.I officially propose to use the term "accidentally typed" or "eventually typed".
Some of the benefits over JSON:
* Real date type
* Real binary type - no need to base64 encode
* Real decimal type - invaluable when working with currency
* Annotations - You can tag an Ion field in a map with an annotation that says, e.g. its compression ("csv", "snappy") or its serialized type ('com.example.Foo').
* Text and binary format
* Symbol tables - this is like automated jsonpack.
* It's self-describing - meaning, unlike Avro, you don't need the schema ahead of time to read or write the data.
Its binary format was introduced in 2002!
Edit: Property lists only support integers up to 128 bits in size and double-precision floating point numbers. On top of those, Ion also supports infinite precision decimals.
(plutil "supports" a json format, but it's not capable of expressing the complete feature set of the XML or binary formats.)
What does JavaScript do with this though, just cast it to a float?
"price": {
"amount": "1500",
"scale": 2,
"symbol": "GBP",
}
Currency has 3 properties, the amount, scale, and symbol.Amount is a string, it holds a bigint. Yes, it's a string.
The value of Scale can be up to 5 but is usually 2 or 3.
Symbol is the ISO code.
Whenever I see a financial system that uses "amount": 15.00 I know that the system is ill-conceived.
I believe that the proper way to handle money is to use Integer values plus a pre-defined precision.
There is no need to have a null which is fragmented into null.timestamp, null.string and whatever. It will complicate processing. Just because you know the type of some element is timestamp, you must worry whether or not it is null and what that means.
There should be just one null value, which is its own type. A given datum is either permitted to be null OR something else like a string. Or it isn't; it is expected to be a string, which is distinct from the null value; no string is a null value.
It's good to have a read notation for a timestamp, but it's not an elementary type; a timestamp is clearly an aggregate and should be understood as corresponding to some structure type. A timestamp should be expressible using that structure, not only as a special token.
This monstrosity is not exhibiting good typing; it is not good static typing, and not good dynamic typing either. Under static typing we can have some "maybe" type instead of null.string: in some representations we definitely have a string. In some other places we have a "maybe string", a derived type which gives us the possibility that a string is there, or isn't. Under dynamic typing, we can superimpose objects of different type in the same places; we don't need a null version of string since we can have "the" one and only null object there.
This looks like it was invented by people who live and breathe Java and do not know any other way of structuring data. Java uses statically typed references to dynamic objects, and each such reference type has a null in its domain so that "object not there" can be represented. But just because you're working on a reference implementation in such a language doesn't mean you cannot transcend the semantics of the implementation language. If you want to propose some broad interoperability standard, you practically must.
In practice, it doesn't. If you want to know if an IonValue is null, ask it with #isNull. If you don't care about the null's type, ignore it. On the other hand, the type is an additional form of metadata which allows overloading the meaning of a value.
nulls can also be annotated, so Ion doesn't really have the concept of a singular shared null sentinel.
More so than JSON, Ion often uses nulls to differentiate presence from value (that is, the lack of a field in a struct has a different meaning the presence of that field with a null value). Since nulls are objects, they can be tested separately from the lack of a field definition.
> a timestamp is clearly an aggregate and should be understood as corresponding to some structure type.
Timestamps are structured types with a literal representation that is explicitly modeled in the specification. You're free to ignore it and use a custom schema for representing time, but you've moved any validation into your application at that point and are no better off than JSON.
It recalls the nullability arguments between the ML family and the C/Java family.
kazinator is asking for safer document semantics and a type-safe API.
Source: I was there.
Source: I am one of the primary authors.
https://avro.apache.org/docs/current/
They both have self-describing schemas, support for binary values, JSON-interoperability, basic type systems (Ion seems to support a few more field types), field annotations, support for schema evolution, code generation not necessary, etc.
I think Avro has the additional advantages of being production-tested in many different companies, a fully-JSON schema, support for many languages, RPC baked into the spec, and solid performance numbers found across the web.
I can't really see why I'd prefer Ion. It looks like an excellent piece of software with plenty of tests, no doubt, but I think I could do without "clobs", "sexprs", and "symbols" at this level of representation, and it might actually be better if I do. Am I missing something?
Ion is designed to be self-describing, meaning that no schema is necessary to deserialize and interact with Ion structures. It's consequently possible to interact with Ion in a dynamic and reflective way, for example, in the same way that you can with JSON and XML. It's possible to write a pretty-printer for a binary Ion structure coming off the wire without having any idea of or schema for what's inside. Ion's advantage over those formats is that it's strongly typed (or richly typed, if you prefer). For example, Ion has types for timestamps, arbitrary-precision decimals like for currency, and can embed binary data directly (without base64 encoding), etc.
I wouldn't try to say that one or the other is better across the board. Rather, they have tradeoffs and relative strengths in different circumstances. Ion is in part designed to tackle scenarios like where your data might live a really long time, and needs to be comprehensible decades from now (whether you kept track of the schema or not, or remember which one it was); and needs to be comprehensible in a large distributed environment where not every application might possess the latest schema or where coordinating a single compile-time schema is a challenge (maybe each app only cares about some part of the data), and so on. Ion is well-suited to long-lived, document-type data that's stored at rest and interacted with in a variety of potentially complex ways over time. Data data. In the case of a simple RPC relationship between a single client and service, where the data being exchanged is ephemeral and won't stick around, and it's easy to definitively coordinate a schema across both applications, a typical serialization framework is a fine choice.
"Avro data is always serialized with its schema. Files that store Avro data should always also include the schema for that data in the same file. Avro-based remote procedure call (RPC) systems must also guarantee that remote recipients of data have a copy of the schema used to write that data."
https://avro.apache.org/docs/current/spec.html#Data+Serializ...
The timing of open-sourcing it mystifies me a bit. Maybe Amazon is trying to become more open-source friendly, like Microsoft did?
Perhaps more likely: they're planning on making some internal APIs that use ION heavily public?
It's a core feature of Ion that the text and binary representations are isomorphic. You can take any Ion binary document and pretty-print it as an Ion text document that is exactly equivalent. You can edit that document and send it into your application, which will be guaranteed to be able to read it. Or you can take your hand-authored text data and transcode it into binary, and know that any Ion application can handle it without any extra effort.
Also, field order matters in Avro but not in JSON. That bit me pretty hard once... fortunately, I found out that Python's JSON library lets you read a JSON file into an OrderedDict instead of a plain dict, so I was able to get around it.
I'm by no means a real Lisp programmer, but even I find S-Expressions more natural to write and process. And the simplicity of it allows for great editing tools too. This may be personal, but I always found JSON clunky.
Several years ago, I wouldn't have imagined this possible and I'm a little bummed that I left before it happened.
Like leef said above, I'm glad to have Ion as an option again.
It's particularly interesting to see the fixes and improvements from the actual open source cleanup effort getting to (many) Internal production services.
Amazon doesn't open source things, as a general rule. It can be done but it is a lot of jumping through hoops and they generally need good reasons to do it (as opposed to a lack of good reasons not to).
So now not only do we have the problem of redundant and mutually incompatible protocols (cue obligatory xkcd), but that we have so many such protocols that name collision is becoming an extra problem.
No need for a new protocol when doing it that way for basic things, if you need more binary (busy messaging/real-time) there are plenty of alternatives to JSON.
I love the simplicity of JSON, so do others and it is successful so many try to attach on to that success. The success part was that it was so damn simple though, most attachments just complicate and add verbosity, echoes back to XML and SOAP wars which spawned the plain and simple JSON. Adding complexity is easy and anyone can do it, good engineers take complexity and make it simple, that is damn difficult.
But in JSON you'd encode that Base64 as a string and the application must know that the data isn't really a string but a blob of some type of encoding. That probably means wrapping in another struct to provide that metadata. Ion provide a terse method of doing the same while maintaining data integrity:
'image/gif'::{{ R0lGODlhAQABAIABAP8AAP///yH5BAEAAAEALAAAAAABAAEAAAICRAEAOw== }}
The 'image/gif' annotation is application specific, but all consumers know that the contents of that value are binary. In the binary Ion representation, those 43-bytes are encoded as a 45 byte value (one byte for the type marker and a second for the length in this case; as little as 47 with the annotation and a shared symbol table), making the binary representation very efficient for transferring binary data.Since Ion is a superset of JSON, it's by definition more complex, but the complexity isn't unapproachable. Most of the engineers I worked with assumed it was JSON until coming across timestamps, annotations, or bare word symbols.
JSON's string literals come from JavaScript, and JavaScript only sortof has a Unicode string type. So the \u escape in both languages encodes a UTF-16 code unit, not a code point. That means in JSON, the single code point U+1f4a9 "Pile of Poo" is encoded thusly:
"\ud83d\udca9"
JSON specifically says this, too, Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
[… snip …]
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
Now, Ion's spec says only: U+HHHH \uHHHH 4-digit hexadecimal Unicode code point
But if we take it to mean code point, then if the value is a surrogate… what should happen?Looking at the code, it looks like the above JSON will parse:
1. Main parsing of \u here:
https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L2429-L2434
2. which is called from here, and just appended to a StringBuilder:
https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L1975
My Java isn't that great though, so I'm speculating. But I'm not sure what should happen.This is just one of those things that the first time I saw it in JSON/JS… a part of my brain melted. This is all a technicality, of course, and most JSON values should work just fine.
Surrogates are code points. The spec does not say what should happen if the surrogate is invalid (for example, if only the first surrogate of a surrogate pair is present), but neither does the JSON spec.
Java internally also represents non-BMP code points using surrogates. So, simply appending the surrogates to the string should yield a valid Java string if the surrogates in the input are valid.
And where does Ion fit here?
Edit: There is a benchmark script that tests a few serializers and validators in Ruby in my [employer's] ClassyHash gem: https://github.com/deseretbook/classy_hash/. It would be easy to add more serializers to the benchmark: https://github.com/deseretbook/classy_hash/blob/master/bench...
Data formats like JSON and XML can be somewhat self-describing, but they aren't always completely. Both tend to need to embed more complex data types as either strings with implied formats, or nested structures. (Consider: How would you represent a timestamp in JSON such that an application could unambiguously read it? An arbitrary-precision decimal? A byte array?) I'm not familiar with EDN, but it appears to be in a similar position as JSON in this regard. ProtocolBuffers, Thrift, and Avro require a schema to be defined in advance, and only work with schema-described data as serialization layers. Ion is designed to work with self-describing data that might be fairly complex, and have no compiled-ahead-of-time schema.
Ion makes it easy to pass data around with high fidelity even if intermediate systems through which the data passes understand only part of the data but not all of it. A classic weakness of traditional RPC systems is that, during an upgrade where an existing structure gains an additional field, that structure might pass through an application that doesn't know about the field yet. Thus when the structure gets deserialized and serialized again, the field is missing. The Ion structure by comparison can be passed from the wire to the application and back without that kind of loss. (Some serialization-based frameworks have solutions to this problem too.)
One downside is that its performance tends to be worse than schema-based serialization frameworks like Thrift/ProtoBuf/Avro where the payload is generally known in advance, and code can be generated that will read and deserialize it. Another downside is that it's difficult to isolate Ion-aware code from the more general purpose "business logic" in an application, due to the absence of a serialization layer producing/consuming POJOs; instead it's common to read an Ion structure from the wire and access it directly from application logic.
However, it doesn't support blobs. I'm conflicted about this point. On one hand, small blobs can occasionally be useful to send within a larger payload. On the other hand, small blobs almost always become large blobs, and so I'd rather plan for out-of-band (preferably even content addressable) representations of blobs.
This is indeed a common pitfall, especially since traversing Ion is slow and expensive. I've squeezed up to 30% performance gain by converting Ion data to POJOs up front and just using those.
* It doesn't have "true" types in the sense that Ion does. It's basically just a binary serialization of JSON, with extra stuff.
* Despite being a binary format, it's actually bulkier than JSON in most situations.
* It removes any semblance of canonicity from many representations. A number, for instance, can potentially be represented by any of at least 3 types (double, int32, and int64).
* It has signed 32-bit length limits all over the place. Not that I'd want to be storing 2GB of data in a single JSON document either, but it's not even possible to do so with BSON!
* It requires redundant null bytes in unpredictable places. For instance, all strings must be stored with a trailing null byte, which is included in their length. There's also a trailing null byte at the end of a document for no reason at all.
* It is unabashedly Javascript-specific, containing types like "JavaScript code with scope" which are meaningless to other languages.
* It also contains some MongoDB-specific cruft, such as the "ObjectID" and "timestamp" types (the latter of which, despite its name, cannot actually be used to store time values).
* It contains numerous "deprecated" and "old" features (in version 1.0!) with no guidance as to how implementations should handle them.
See e.g. https://metacpan.org/pod/Cpanel::JSON::XS#SECURITY-CONSIDERA... I need to add ion to this security matrix.
YAML does most of those and does more and can be made quite secure by limiting the allowed types to the absolute and trusted minimum, but this e.g. not implemented in the perl, only the python backend. By default YAML is extremely insecure.
There are more new readable and typed JSON variants out there. E.g. jzon-c should be faster than ion, but there are also Hjson and SJSON. See https://github.com/KarlZylinski/jzon-c
I've no clue about the trailing NUL on the record itself, perhaps a safety feature?
What? This means their "arbitrary-precision decimals" are actually isomorphic to (Rational x Natural).
e.g. in Python:
>>> from decimal import Decimal as D
>>> 2 * D("1.0")
Decimal('2.0')
>>> 2 * D("1.000")
Decimal('2.000')
>>> D("1.0") == D("1.000")
TrueThe Ion value 0.0 has one digit of precision (after the decimal point), while the value 0.00 has two. In the Ion data model, those are two distinct values, and conforming implementations must maintain the distinction.
(Insert joke here about Google engineers just copying around protobufs.)
https://www.reddit.com/r/haskell/comments/4fhuw3/json_for_ad...
ASN.1 also has a million baroque types (VideotexString, anyone?) where most people just need "string", "small int", "big int", etc.
Some more on BER parsing hell here: https://mirage.io/blog/introducing-asn1
...unless you're Fabrice Bellard, who apparently wrote one just because it was one of the minor obstacles on the way to writing a full LTE base station:
1) open types - typically applications consuming Ion data
do not restrict the fields included (that is, they
gracefully ignore, and often even pass along additional
fields). Schemas may grow while being backwards
compatible with existing software.
2) Type annotations allow embedding schema information into
a datagram without the need for agreeing on special
fields. Datagrams may have multiple values at the top
level, so its possible to provide multiple
representations without introducing a new top-level
container.
3) The only data might need to be shared between a producer
and consumer is a SymbolTable which may be applicable to
several schemas and may be shared inline if necessary.
Otherwise, objects in a datagram are always inspectable
and discoverable without additional metadata.It has isomorphic text and binary representations as part of the standard making debugging or optimized transport a config option.
The type system is significantly richer than JSON and maps well to several languages (internally Amazon uses it with C, C++, Perl, Java, Ruby, etc.).
S-Expressions.
Then how is the client supposed to handle the data? Guessing?
> backwards compatible schemas
> text and binary representations
> type system
> maps well to several languages
Protos have all these.
> S-Expressions
Okay? Is that useful?
This really looks like a NIH specification.
Basically: Ion == JSON + extra features + binary format spec. But Ion ~= YAML + binary format spec. You're going to write a new serializer/deserializer in both cases anyway, but in the second one, at least you get the text part for free in almost any language available.
- IonValues are mutable by default. I saw bugs where cached IonValues were accidentally changed, which is easy to do: IonSequence.extract clears the sequence [1], adding an IonValue to a container mutates the value (!) [2], etc.
- IonValues are not thread-safe [3]. You can call makeReadOnly() to make them immutable, but then you'll be calling clone since doing anything useful (like adding it to a list) will need to mutate the value. While it says IonValues are not even thread-safe for reading, I believe this is not strictly true. There was an internal implementation that would lazily materialize values on read, but it doesn't look like it's included in the open source version.
- IonStruct can have multiple fields with the same name, which means it can't implement Map. I've never seen anyone use this (mis)feature in practice, and I don't know where it would be useful.
- Since IonStruct can't implement Map, you don't get the Java 8 default methods like forEach, getOrDefault, etc.
- IonStruct doesn't implement keySet, values, spliterator, or stream, and thus doesn't play well with the Java 8 Stream API.
- Calling get(fieldName) on an IonStruct returns null if the field isn't present. But the value might also be there and be null, so you end up having to do a null check AND call isNullValue(). I'm not convinced it's a worthwhile distinction, and would have preferred a single way of doing it. You can already call containsKey to check for the presence of a field.
- In practice most code that dealt with Ion was nearly as tedious and verbose as pulling values out of an old-school JSONObject. Every project seemed to have a slightly different IonUtils class for doing mundane things like pulling values out of structs, doing all the null checks, casting, etc. There was some kind of adapter for Jackson that would allow you to deserialize to a POJO, but it didn't seem like it was widely used.
[1] https://github.com/amznlabs/ion-java/blob/master/src/softwar...
[2] https://github.com/amznlabs/ion-java/blob/master/src/softwar...
[3] https://github.com/amznlabs/ion-java/blob/master/src/softwar...
Why not "com.amazon.ion", like thousands of other existing packages?