Super-Structured Data: Rethinking the Schema (opens in new tab)

(brimdata.io)

108 pointsmccanne4y ago46 comments

46 comments

> The idea here is that instead of manually creating schemas, what if the schemas were automatically created for you? When something doesn’t fit in a table, how about automatically adding columns for the missing fields?

I've been experimenting with this approach against SQLite for a few years now, and I really like it.

My sqlite-utils package does exactly this. Try running this on the command line:

    brew install sqlite-utils
    echo '[
      {"id": 1, "name": "Cleo"},
      {"id": 2, "name": "Azy", "age": 1.5}
    ]' | sqlite-utils insert /tmp/demo.db creatures - --pk id
    sqlite-utils schema /tmp/demo.db

It outputs the generated schema:

    CREATE TABLE [creatures] (
       [id] INTEGER PRIMARY KEY,
       [name] TEXT,
       [age] FLOAT
    );

When you insert more data you can use the --alter flag to have it automatically create any missing columns.

Full documentation here: https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...

It's also available as a Python library: https://sqlite-utils.datasette.io/en/stable/python-api.html

mccanneOP4y ago

Simon, brilliant observations here and kudos on sqlite-utils.

I'm all for layers, a fundamental approach in our field to tame complexity. And the SQL model and SQLite have stood the test of time and are solid foundations.

I'm just wondering could we be stuck in a local maximum where the presumed answer is always the relational model? Maybe if we built the relational model on top of a different set of lower-level primitives (a type system instead of schemas and tables) we could escape local maximum we're stuck in? Just a thought.

There are a few somewhat ad hoc perf measurements here regarding the sqlite-utils and sqlite... https://zed.brimdata.io/docs/commands/zq/#73-performance-com...

I'm not a SQLite expert so if I did something wrong, please holler and let me know :)

simonw4y ago

Cool, I didn't realize you used sqlite-utils for your performance demo!

It's not particularly designed for speed - it should be fast as far as Python code goes (I use some generator tricks to stream data and avoid having to load everything into memory at once) but I wouldn't expect "sqlite-utils insert" to win any performance competitions with tools written in other languages.

Those benchmarks against sqlite itself are definitely interesting. I'm looking forward to playing with the "native ZNG support for Python" mentioned on https://github.com/brimdata/zed/blob/main/docs/libraries/pyt... when that becomes available.

1 more reply

mamcx4y ago

Note: The relational model (even SQL) is THIS.

Despite the claims, SQL is NOT "schema-fixed".

You can 100% create new schemas, alter them and modify them.

What actual happens is that if you have a CENTRAL repository of data (aka "source of truth"), then you bet you wanna "freeze" your schemas (because is like a API, where you need to fulfill contracts).

SQL have limitations in lack of composability, the biggest reason "NoSQL" work is this: A JSON is composable. A "stringy" SQL is not. If SQL were really around "relations, tupes" like (stealing from my project, TablaM):

    [Customer id:i32, name:Str; 1, "Jhon"]

then developers will have less reason to go elsewhere.

mccanneOP4y ago

Author here. All good points. Yes, you can build a super-structured type system on top of tables. EdgeDB does this well. And you can put JSON into relational columns. Then you might ask what the "type" of that column is? Well, if you want deep types, the row type varies from column to column as the JSON values vary and you have to walk the JSON to determine the type. SQL implementation are beginning to try to do deal with this mess by adding layers on top of tables. We're saying, maybe we should think differently about the problem and build tables on top of types as a special case of a type system. This also gives a very nice way to get data into and out of systems without having to go through the messiness of ODBC and special casing tables vs tuples vs scalars etc.

cryptonector4y ago

Normalize to the max then denormalize till you achieve the performance trade-offs you want. That's the rule in relational schema design.

Adding JSON traversal operators and functions helps a lot when you end up denormalizing bits of the schema. It's not hard.

vosper4y ago

You mentioned EdgeDB in the blog post, too, but I just think you and them are dealing with different problems.

My understanding of EdgeDB is they're mostly trying to make correct data-modeling simpler and more intuitive; to let people model relations in the same way they speak and think about it, rather than having to map to SQL concepts like join tables. I rather like what they're going for, though I haven't used it.

EdgeDB seems to be mostly for business logic and OLTP. They're not trying to deal with arbitrary incoming data that might be outside of the control of the ingestion system. You wouldn't even have an ingestion system with EdgeDB.

mamcx4y ago

This is true and is a limitation of SQL (not of the relational model per-se), and also is part of the problem that SQL is not composable (so you don't have a way to nested table definitions)

zmgsabst4y ago

I’m not sure what you mean by “composable” here — could you elaborate?

mamcx4y ago

Composable is the ability to define things in the small and combine with confidence.

SQL not allow this:

    by_id := WHERE id = $1
     
    SELECT * | by_id

1 more reply

flappyeagle4y ago

why hasn't someone built a composable flavor of SQL? it seems like a burning need

hobofan4y ago

I would say that in a way, that's what substrait[0] is trying to achieve.

[0]: https://substrait.io

pjungwir4y ago

This is what Tutorial D is, but it's never been widely adopted.

CharlesW4y ago

The words "anarchy" and "authoritarianism" seem unnecessarily emotional and pejorative, and because of their semantic baggage I personally wouldn't use them in a professional situation. The author counts on the emotional color of those words to attempt an argument that both are somehow bad.

Instead of those words I'd suggest something like "schema on write" vs. "schema on read", or "persisted structured" vs. "persisted unstructured". "Document" vs. "relational" doesn't quite capture it, since unstructured data can have late-binding relations applied at read time, and structured data doesn't have to be relational.

And of course, modern relational databases can store unstructured data as easily as structured data.

anentropic4y ago

The first few sections of this post nearly lost me, waffling on about NoSQL vs whatever.

Eventually we get to the meat:

> For example, the JSON value

  {"s":"foo","a":[1,"bar"]}

> would traditionally be called “schema-less” and in fact is said have the vague type “object” in the world of JavaScript or “dict” in the world of Python. However, the super-structured interpretation of this value’s type is instead:

> type record with field s of type string and field a of type array of type union of types integer and string

> We call the former style of typing a “shallow” type system and the latter style of typing a “deep” type system. The hierarchy of a shallow-typed value must be traversed to determine its structure whereas the structure of a deeply-typed value is determined directly from its type.

This is a bit confusing, since JSON data commonly has an implicit schema, or "deep type system" as this post calls it, and if you consume data in any statically-typed language you will materialise the implicit "deep" types in your host language.

So it seems that ZSON is sort of like a TypeScript-ified version of JSON, where the implicit types are made explicit.

It seems the point is not to have an external schema that documents must comply to, so I guess at the end of the day has similar aim to other "self-describing" message formats like https://amzn.github.io/ion-docs/ ? i.e. each message has its own schema

So the interesting part is perhaps the new data tools to work with large collections of self-describing messages?

vosper4y ago

> The first few sections of this post nearly lost me, waffling on about NoSQL vs whatever.

Since the author of the blog post is here, I'll just jump in to agree with this part: there is a lot of unecessary background text before we get to the meat of it. I don't think people need a history lesson on NoSQL and SQL, and IMO the "authoritarianism" metaphor is a stretch, and that word has pretty negative connotations.

I think there's some value in setting the scene, but I think you will lose readers before they get to the much more interesting content further down. I recommend revising it to be a lot shorter.

rco87864y ago

+1 the second 60% of the article is incredibly interesting, but I almost gave up before I got to the meat.

rco87864y ago

> This is a bit confusing, since JSON data commonly has an implicit schema, or "deep type system" as this post calls it, and if you consume data in any statically-typed language you will materialise the implicit "deep" types in your host language

That is an incredibly expensive operation to perform. Being able to look at two binary blobs of data and quickly determining whether or not they are the same type of data unlocks a whole host of functionality over large amounts of data that is otherwise prohibitively expensive and slow.

troelsSteegin4y ago

It looks like the use case is specifying types for dataflow operators (aka endpoints for dataflow pipes) [0] and I surmise composition should be super easy. I was surprised not to see any mention of XML or XML Schema as prior art, especially with their discussion of schema registries. Edit: Oh, the point of reference is Kafka [1]

[0] https://zed.brimdata.io/docs/language/overview/ [1] https://docs.confluent.io/platform/current/schema-registry/i...

hbarka4y ago

I also thought about XML. It has the Document Object Model (DOM), the structure which describes the data.

kmerroll4y ago

Interesting discussion, but buried in a lot of legacy thinking about schemas and personally, I don't find Yet-Another-Schema-Abstraction (YASA)™ layer very compelling when better solutions in functional programming and semantic ontologies are far ahead in this area.

Suggest looking into JSON-LD which was intended to solve many of the type and validation use-cases related to type and schema.

abraxaz4y ago

To pile on a bit here, JSON-LD is based on RDF, which is an abstract syntax for data as semantic triples (i.e. RDF statements), there is also RDF* which is in development which extends this basic data model to make statements about statements.

RDF has concrete syntaxes, one of them being JSON-LD, and it can be used to model relational databases fairly well with R2RML (https://www.w3.org/TR/r2rml/) which essentially turns relation databases into a concrete syntax for RDF.

schema.org is also based on RDF, and is essentially an ontology (one of many) that can be used for RDF and non RDF data, but mainly because almost all data can be represented as RDF - so non RDF data is just data that does not have a formal mapping to RDF yet.

Ontologies is a concept used frequently in RDF but rarely outside of it, it is quite important for federated or distributed knowledge, or descriptions of entities. It focuses heavily on modelling properties instead of modelling objects, and then whenever a property occurs that property can be understood within the context of an ontology.

An example is the age of a person (https://schema.org/birthDate)

When I get a semantic triple:

<example:JohnSmith> <https://schema.org/birthDate> "2000-01-01"^^<https://schema.org/Date>

This tells me that the entity identified by the IRI <example:JohnSmith> is a person - and their birth date is 2000-01-01. I however don't expect that i will get all other descriptions of this person at the same time, I won't necessarily get their <https://schema.org/nationality> for example, even though this is a property of a <https://schema.org/Person> defined by schema.org

I can also combine https://schema.org/ based descriptions with other descriptions, and these descriptions can be merged from multiple sources and then queried together using SPARQL.

miguelrochefort4y ago

CBOR-based Serialization for Linked Data:

https://digitalbazaar.github.io/cbor-ld-spec/

RDF Binary Encoding using Thrift:

https://afs.github.io/rdf-thrift/rdf-binary-thrift.html

vlmutolo4y ago

What are the "better solutions in functional programming and semantic ontologies"? What would I Google?

abraxaz4y ago

A place to start looking may be the OWL primer (https://www.w3.org/TR/owl2-primer/) and the RDF primer (https://www.w3.org/TR/rdf11-primer/)

Other resources: https://github.com/semantalytics/awesome-semantic-web

ducharmdev4y ago

You're right, I think Yet-Another-Schema-Solution (YASS)™ would be much more compelling!

(Please forgive me)

difflens4y ago

Perhaps I don't understand their use case fully, but it seems to me that every schema can be defined as a child protobuf message, and each child can then be added to a oneof field of a parent protobuf message. This way, you get the strict/optional type checks that are required, and the efficiency and ecosystem around protobufs.

mccanneOP4y ago

Author here. This totally makes sense. The challenge here is you need to store the type definitions somewhere (e.g., in the .proto files) and any system that processes protocol buffers needs to know which proto to apply to which messages. The theme of super-structured data is that this type structure should be native to the serialized data and our premise is this leads to better DX (though Zed is early and the jury is out). Perhaps flexbuffers is a closer analogy, which I should have mentioned in the article.

tabtab4y ago

"Dynamic Relational" needs to be implemented. Columns and (optionally) tables are "create on write". If you issue "SELECT nonExistingColumn FROM myTable" you get nulls (if rows exist), not an error. One can incrementally "lock down" the schema as a project matures by adding constraints. Unlike the "NoSql" movement, it does not throw out most of RDBMS concepts, just tweaks them only enough to be dynamic-friendly. This reduces the learning curve.

natemcintosh4y ago

So it sounds like one of the advantages of the Zed ecosystem is that its data can go into three file formats (zson, zng, zst), each designed for a specific use case, and convert between them easily and without loss.

And it seems like the newer "zed lake" format is like a large blob managed by a server. Can you also convert data to and from and the file formats to the lake format? What is the lake's main use case?

zedlover4y ago

It seems the zed lake is a data lake implementation rather than a format. It uses ZNG for both metadata and data storage internally I think.

bthomas4y ago

I didn't follow this part:

> EdgeDB is essentially a new data silo whose type system cannot be used to serialize data external to the system.

I think this implies that serializing external data to zson is easier than writing an INSERT into edgedb, but not sure why that would be.

munro4y ago

I love the idea of getting rid of tables, when developing application code I'm often thinking in terms of Maps/Sets/Lists--I wish I could just take that code and make it persistent. PRIMARY KEY is really like a map. Also I wish I had transactional memory in my application. Not sure what the future looks like, but I am loving all this development in the database space.

thinkharderdev4y ago

Arrow has union types (as well as structs and dictionary types). Parquet doesn't but I think it has an intentionally shallow types system to allow flexibility in encoding. Basically everything is either a numeric or binary and the logical type for binary columns is defined in metadata. So you can use, for instance, Arrow as the encoding.

Kinrany4y ago

Yes, the comparison with Arrow ecosystem should really be more in depth since that's the closest thing that exists.

SPBS4y ago

This is a data serialization format, not a replacement for storing your business data. Your business data needs to have the same schema enforced everywhere, otherwise how are you going to reconcile your user data now and your user data 5 months ago if their schemas are radically different?

ccleve4y ago

tldr; Don't use relational tables or unstructured document databases. Instead use structured types. The "schema" here is ultimately a collection of independent objects / classes with well-defined fields.

Ok, fine. But I'm not sure how this helps if you have six different systems with six different definitions of a customer, and more importantly, different relationships between customers and other objects like orders or transactions or locations or communications.

I don't see their approach as ground-breaking, but it is definitely worthy of discussion.

abraxaz4y ago

> Ok, fine. But I'm not sure how this helps if you have six different systems with six different definitions of a customer, and more importantly, different relationships between customers and other objects like orders or transactions or locations or communications.

If you have this problem, consider giving RDF a look - you can fairly easily use RDF based technologies to map the data in these systems onto a common model, some examples of tools that may be useful here is https://www.w3.org/TR/r2rml/ and https://github.com/ontop/ontop - you can also use JSON-LD to convert most JSON data to RDF. For more info ask in https://gitter.im/linkeddata/chat

HelloNurse4y ago

It helps if this machinery can reject data and thus perform validation. Since recursive construction of union types (valid records can look like this, or also like that...) is trivial, a programmer somewhere has to draw the line between "loosen the schema to allow this record" and "reject this record to enforce the schema".

mccanneOP4y ago

Author here. Agreed! Validation is important. While I didn't make this point in the article, our thinking is schema validation does not require that the serialization format utilize schemas as the building block and you can always implementation schema (or type) validation (and versioning) on top of super-structured data (as can also be done with document databases).

1 more reply

loquisgon4y ago

I got interested when the different spectrum points of json and relational were contrasted. So I read the whole thing. I got lost and disheartened when the new terminology, starting with the super-structured name was introduced and completely went downhill with the other z names. Maybe it's just me and maybe it is like quantum mechanics and any other innovation where new names don't make sense and feel ugly.

feoren4y ago

Wow, what a waste of time. I've been doing it correctly for so long that I forget that virtually everyone else on the planet has no idea how to build a good data model. What pisses me off is that I actually have the right answer on how to avoid all of this pain, but if I typed it out here I'd either waste my time and get ignored or (much, much less likely) get my idea poached. It takes hours to fully communicate anyway. What do you do when you know you're sitting on an approach & tech that could revolutionize the X-hundred-billion-dollar data management industry but you can barely even get your own fucking employer to take you seriously?

Anyway this article is crap and gets everything wrong, just like all of you do. Whatever, nothing to see here I guess.

j / k navigate · click thread line to collapse

46 comments

simonw4y ago

I've been experimenting with this approach against SQLite for a few years now, and I really like it.

My sqlite-utils package does exactly this. Try running this on the command line:

    brew install sqlite-utils
    echo '[
      {"id": 1, "name": "Cleo"},
      {"id": 2, "name": "Azy", "age": 1.5}
    ]' | sqlite-utils insert /tmp/demo.db creatures - --pk id
    sqlite-utils schema /tmp/demo.db

It outputs the generated schema:

    CREATE TABLE [creatures] (
       [id] INTEGER PRIMARY KEY,
       [name] TEXT,
       [age] FLOAT
    );

When you insert more data you can use the --alter flag to have it automatically create any missing columns.

Full documentation here: https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...

It's also available as a Python library: https://sqlite-utils.datasette.io/en/stable/python-api.html

mccanneOP4y ago

Simon, brilliant observations here and kudos on sqlite-utils.

I'm all for layers, a fundamental approach in our field to tame complexity. And the SQL model and SQLite have stood the test of time and are solid foundations.

There are a few somewhat ad hoc perf measurements here regarding the sqlite-utils and sqlite... https://zed.brimdata.io/docs/commands/zq/#73-performance-com...

I'm not a SQLite expert so if I did something wrong, please holler and let me know :)

simonw4y ago

Cool, I didn't realize you used sqlite-utils for your performance demo!

1 more reply

mamcx4y ago

Note: The relational model (even SQL) is THIS.

Despite the claims, SQL is NOT "schema-fixed".

You can 100% create new schemas, alter them and modify them.

    [Customer id:i32, name:Str; 1, "Jhon"]

then developers will have less reason to go elsewhere.

mccanneOP4y ago

cryptonector4y ago

Normalize to the max then denormalize till you achieve the performance trade-offs you want. That's the rule in relational schema design.

Adding JSON traversal operators and functions helps a lot when you end up denormalizing bits of the schema. It's not hard.

vosper4y ago

You mentioned EdgeDB in the blog post, too, but I just think you and them are dealing with different problems.

mamcx4y ago

This is true and is a limitation of SQL (not of the relational model per-se), and also is part of the problem that SQL is not composable (so you don't have a way to nested table definitions)

zmgsabst4y ago

I’m not sure what you mean by “composable” here — could you elaborate?

mamcx4y ago

Composable is the ability to define things in the small and combine with confidence.

SQL not allow this:

    by_id := WHERE id = $1
     
    SELECT * | by_id

1 more reply

flappyeagle4y ago

why hasn't someone built a composable flavor of SQL? it seems like a burning need

hobofan4y ago

I would say that in a way, that's what substrait[0] is trying to achieve.

[0]: https://substrait.io

pjungwir4y ago

This is what Tutorial D is, but it's never been widely adopted.

CharlesW4y ago

And of course, modern relational databases can store unstructured data as easily as structured data.

anentropic4y ago

The first few sections of this post nearly lost me, waffling on about NoSQL vs whatever.

Eventually we get to the meat:

> For example, the JSON value

  {"s":"foo","a":[1,"bar"]}

> type record with field s of type string and field a of type array of type union of types integer and string

So it seems that ZSON is sort of like a TypeScript-ified version of JSON, where the implicit types are made explicit.

So the interesting part is perhaps the new data tools to work with large collections of self-describing messages?

vosper4y ago

> The first few sections of this post nearly lost me, waffling on about NoSQL vs whatever.

I think there's some value in setting the scene, but I think you will lose readers before they get to the much more interesting content further down. I recommend revising it to be a lot shorter.

rco87864y ago

+1 the second 60% of the article is incredibly interesting, but I almost gave up before I got to the meat.

rco87864y ago

troelsSteegin4y ago

[0] https://zed.brimdata.io/docs/language/overview/ [1] https://docs.confluent.io/platform/current/schema-registry/i...

hbarka4y ago

I also thought about XML. It has the Document Object Model (DOM), the structure which describes the data.

kmerroll4y ago

Suggest looking into JSON-LD which was intended to solve many of the type and validation use-cases related to type and schema.

abraxaz4y ago

An example is the age of a person (https://schema.org/birthDate)

When I get a semantic triple:

<example:JohnSmith> <https://schema.org/birthDate> "2000-01-01"^^<https://schema.org/Date>

I can also combine https://schema.org/ based descriptions with other descriptions, and these descriptions can be merged from multiple sources and then queried together using SPARQL.

miguelrochefort4y ago

CBOR-based Serialization for Linked Data:

https://digitalbazaar.github.io/cbor-ld-spec/

RDF Binary Encoding using Thrift:

https://afs.github.io/rdf-thrift/rdf-binary-thrift.html

vlmutolo4y ago

What are the "better solutions in functional programming and semantic ontologies"? What would I Google?

abraxaz4y ago

A place to start looking may be the OWL primer (https://www.w3.org/TR/owl2-primer/) and the RDF primer (https://www.w3.org/TR/rdf11-primer/)

Other resources: https://github.com/semantalytics/awesome-semantic-web

ducharmdev4y ago

You're right, I think Yet-Another-Schema-Solution (YASS)™ would be much more compelling!

(Please forgive me)

difflens4y ago

mccanneOP4y ago

tabtab4y ago

natemcintosh4y ago

zedlover4y ago

It seems the zed lake is a data lake implementation rather than a format. It uses ZNG for both metadata and data storage internally I think.

bthomas4y ago

I didn't follow this part:

> EdgeDB is essentially a new data silo whose type system cannot be used to serialize data external to the system.

I think this implies that serializing external data to zson is easier than writing an INSERT into edgedb, but not sure why that would be.

munro4y ago

thinkharderdev4y ago

Kinrany4y ago

Yes, the comparison with Arrow ecosystem should really be more in depth since that's the closest thing that exists.

SPBS4y ago

ccleve4y ago

I don't see their approach as ground-breaking, but it is definitely worthy of discussion.

abraxaz4y ago

HelloNurse4y ago

mccanneOP4y ago

1 more reply

loquisgon4y ago

feoren4y ago

Anyway this article is crap and gets everything wrong, just like all of you do. Whatever, nothing to see here I guess.

j / k navigate · click thread line to collapse