I've been experimenting with this approach against SQLite for a few years now, and I really like it.
My sqlite-utils package does exactly this. Try running this on the command line:
brew install sqlite-utils
echo '[
{"id": 1, "name": "Cleo"},
{"id": 2, "name": "Azy", "age": 1.5}
]' | sqlite-utils insert /tmp/demo.db creatures - --pk id
sqlite-utils schema /tmp/demo.db
It outputs the generated schema: CREATE TABLE [creatures] (
[id] INTEGER PRIMARY KEY,
[name] TEXT,
[age] FLOAT
);
When you insert more data you can use the --alter flag to have it automatically create any missing columns.Full documentation here: https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...
It's also available as a Python library: https://sqlite-utils.datasette.io/en/stable/python-api.html
I'm all for layers, a fundamental approach in our field to tame complexity. And the SQL model and SQLite have stood the test of time and are solid foundations.
I'm just wondering could we be stuck in a local maximum where the presumed answer is always the relational model? Maybe if we built the relational model on top of a different set of lower-level primitives (a type system instead of schemas and tables) we could escape local maximum we're stuck in? Just a thought.
There are a few somewhat ad hoc perf measurements here regarding the sqlite-utils and sqlite... https://zed.brimdata.io/docs/commands/zq/#73-performance-com...
I'm not a SQLite expert so if I did something wrong, please holler and let me know :)
It's not particularly designed for speed - it should be fast as far as Python code goes (I use some generator tricks to stream data and avoid having to load everything into memory at once) but I wouldn't expect "sqlite-utils insert" to win any performance competitions with tools written in other languages.
Those benchmarks against sqlite itself are definitely interesting. I'm looking forward to playing with the "native ZNG support for Python" mentioned on https://github.com/brimdata/zed/blob/main/docs/libraries/pyt... when that becomes available.
Despite the claims, SQL is NOT "schema-fixed".
You can 100% create new schemas, alter them and modify them.
What actual happens is that if you have a CENTRAL repository of data (aka "source of truth"), then you bet you wanna "freeze" your schemas (because is like a API, where you need to fulfill contracts).
--
SQL have limitations in lack of composability, the biggest reason "NoSQL" work is this: A JSON is composable. A "stringy" SQL is not. If SQL were really around "relations, tupes" like (stealing from my project, TablaM):
[Customer id:i32, name:Str; 1, "Jhon"]
then developers will have less reason to go elsewhere.Adding JSON traversal operators and functions helps a lot when you end up denormalizing bits of the schema. It's not hard.
My understanding of EdgeDB is they're mostly trying to make correct data-modeling simpler and more intuitive; to let people model relations in the same way they speak and think about it, rather than having to map to SQL concepts like join tables. I rather like what they're going for, though I haven't used it.
EdgeDB seems to be mostly for business logic and OLTP. They're not trying to deal with arbitrary incoming data that might be outside of the control of the ingestion system. You wouldn't even have an ingestion system with EdgeDB.
SQL not allow this:
by_id := WHERE id = $1
SELECT * | by_id[0]: https://substrait.io
Instead of those words I'd suggest something like "schema on write" vs. "schema on read", or "persisted structured" vs. "persisted unstructured". "Document" vs. "relational" doesn't quite capture it, since unstructured data can have late-binding relations applied at read time, and structured data doesn't have to be relational.
And of course, modern relational databases can store unstructured data as easily as structured data.
Eventually we get to the meat:
> For example, the JSON value
{"s":"foo","a":[1,"bar"]}
> would traditionally be called “schema-less” and in fact is said have the vague type “object” in the world of JavaScript or “dict” in the world of Python. However, the super-structured interpretation of this value’s type is instead:> type record with field s of type string and field a of type array of type union of types integer and string
> We call the former style of typing a “shallow” type system and the latter style of typing a “deep” type system. The hierarchy of a shallow-typed value must be traversed to determine its structure whereas the structure of a deeply-typed value is determined directly from its type.
This is a bit confusing, since JSON data commonly has an implicit schema, or "deep type system" as this post calls it, and if you consume data in any statically-typed language you will materialise the implicit "deep" types in your host language.
So it seems that ZSON is sort of like a TypeScript-ified version of JSON, where the implicit types are made explicit.
It seems the point is not to have an external schema that documents must comply to, so I guess at the end of the day has similar aim to other "self-describing" message formats like https://amzn.github.io/ion-docs/ ? i.e. each message has its own schema
So the interesting part is perhaps the new data tools to work with large collections of self-describing messages?
Since the author of the blog post is here, I'll just jump in to agree with this part: there is a lot of unecessary background text before we get to the meat of it. I don't think people need a history lesson on NoSQL and SQL, and IMO the "authoritarianism" metaphor is a stretch, and that word has pretty negative connotations.
I think there's some value in setting the scene, but I think you will lose readers before they get to the much more interesting content further down. I recommend revising it to be a lot shorter.
That is an incredibly expensive operation to perform. Being able to look at two binary blobs of data and quickly determining whether or not they are the same type of data unlocks a whole host of functionality over large amounts of data that is otherwise prohibitively expensive and slow.
[0] https://zed.brimdata.io/docs/language/overview/ [1] https://docs.confluent.io/platform/current/schema-registry/i...
Suggest looking into JSON-LD which was intended to solve many of the type and validation use-cases related to type and schema.
RDF has concrete syntaxes, one of them being JSON-LD, and it can be used to model relational databases fairly well with R2RML (https://www.w3.org/TR/r2rml/) which essentially turns relation databases into a concrete syntax for RDF.
schema.org is also based on RDF, and is essentially an ontology (one of many) that can be used for RDF and non RDF data, but mainly because almost all data can be represented as RDF - so non RDF data is just data that does not have a formal mapping to RDF yet.
Ontologies is a concept used frequently in RDF but rarely outside of it, it is quite important for federated or distributed knowledge, or descriptions of entities. It focuses heavily on modelling properties instead of modelling objects, and then whenever a property occurs that property can be understood within the context of an ontology.
An example is the age of a person (https://schema.org/birthDate)
When I get a semantic triple:
<example:JohnSmith> <https://schema.org/birthDate> "2000-01-01"^^<https://schema.org/Date>
This tells me that the entity identified by the IRI <example:JohnSmith> is a person - and their birth date is 2000-01-01. I however don't expect that i will get all other descriptions of this person at the same time, I won't necessarily get their <https://schema.org/nationality> for example, even though this is a property of a <https://schema.org/Person> defined by schema.org
I can also combine https://schema.org/ based descriptions with other descriptions, and these descriptions can be merged from multiple sources and then queried together using SPARQL.
https://digitalbazaar.github.io/cbor-ld-spec/
RDF Binary Encoding using Thrift:
Other resources: https://github.com/semantalytics/awesome-semantic-web
(Please forgive me)
And it seems like the newer "zed lake" format is like a large blob managed by a server. Can you also convert data to and from and the file formats to the lake format? What is the lake's main use case?
> EdgeDB is essentially a new data silo whose type system cannot be used to serialize data external to the system.
I think this implies that serializing external data to zson is easier than writing an INSERT into edgedb, but not sure why that would be.
Ok, fine. But I'm not sure how this helps if you have six different systems with six different definitions of a customer, and more importantly, different relationships between customers and other objects like orders or transactions or locations or communications.
I don't see their approach as ground-breaking, but it is definitely worthy of discussion.
If you have this problem, consider giving RDF a look - you can fairly easily use RDF based technologies to map the data in these systems onto a common model, some examples of tools that may be useful here is https://www.w3.org/TR/r2rml/ and https://github.com/ontop/ontop - you can also use JSON-LD to convert most JSON data to RDF. For more info ask in https://gitter.im/linkeddata/chat
Anyway this article is crap and gets everything wrong, just like all of you do. Whatever, nothing to see here I guess.