Step 1) Find all "paths" between tables
* Use the postgres information schema to get all relations * Use npm library graph-cycles to see if there are any graph cycles. If so.. some relations go on a blacklist. * Use npm library topopsort to sort the graph * Traverse the graph and find all possible paths from and to tables * Generate SQL queries to look up "affected" other tables. Input: table name + id Output: list of table names + ids
Step 2) Track changes in database using a transactional outbox
* Create outbox table fields: id, timestamp, event, table_name, table id, jsonb payload * After running migrations have a script that ensures every table has triggers on insert,update,delete that would insert a row on each of these events in the outbox
Step 3) Compute ( You want to do reporting for a certain table ( target table ) which has relationships with other tables and for this table you want a "materialized" view )
* Have a script on a loop that takes all the outbox entries since last time processed * Use the queries from step1 to find out which ids of "target table" are affected * Then only materialize / compute those ids * Store in Elasticsearch ( in our case )
This is not a solution if you are Facebook scale. But as a small SAAS company with not too many transactions this works brilliantly. And with more tenants you can just scale this up by sharding the inbox.
Bonus points: have elasticsearch (ingest node) and a postgres replica on the same host as the "compute" script. So if you have a lot of queries calculating the "dependencies" you get better performance.
sorry for my terrible explanation
You could also for instance create a script and use a postgresql logical replication connection ( just like Debezium ) stream the changes into elasticsearch. Without having a full Kafka connect setup. And all the training an maintenance that comes with it.
What I am describing is, before storing the data in Elasticsearch computing the data in a materialized format. So it's more efficient for Elasticsearch to work with. And no longer having any needs for joins.
So instead of recomputing the materialized data every time in it's entirety you want to be more smart about it. The root table of the materialized data depends on maybe 5 other tables. So if data in those 5 other tables change you need to know if they have a relationship with a row in the "root table". And then only re-materialize those rows.
Materialize does this by having it's own SQL language where you define your materialized view. Which compiles to clever algorithms and uses an execution engine to get to this result.
What I am doing is just having a lookup graph + queries to see what tables, id entries are invalidated. And I re-materialize using normal sql and some extra processing using a nodejs script to make it more optimal for Elasticsearch.
It's not as fancy. But it works and does the job.
Materialize, by contrast, has been explicitly designed to preserve the consistency present in your upstream system. Our PostgreSQL source, for example, ensures that transactions committed to PostgreSQL appear atomically in Materialize, even when those transactions span multiple tables. See our "consistency guarantees" docs for some more information [0]. We have some additional features coming down the pipe, too, like allowing you to guarantee that your queries against Materialize reflect the latest data in your upstream sources [1].
[0]: https://materialize.com/docs/unstable/overview/isolation-lev...
[1]: https://github.com/MaterializeInc/materialize/issues/11531
Then you can have "processes" that query for new data in an input table, and update aggregates/derived tables from that simply by "select * ... where ChangeSequenceNumber > @MaxSequenceNumberFromPreviousExecution"...
The idea here implemented for Microsoft SQL for the OLTP case:
https://github.com/vippsas/mssql-changefeed https://github.com/vippsas/mssql-changefeed/blob/main/MOTIVA...
If you are ingesting events from e.g. Kafka or other similar sources it is easier to assign sequence numbers though and this is not needed
It's really exciting to see a system getting built by people who know what they are doing and will be hopefully a more "correct" solution.
I've been following Materialize as their blog posts are a great source of inspiration when working on OctoSQL[0] (a CLI SQL dataflow engine), but was a bit surprised with how few data sources they were supporting (basically Kafka and Postgres based on their docs), but now that they're switching/pivoting to being a database themselves, this makes much more sense.
I also think the architecture is really cool. Cloud-native is the way to go for modern databases and will make adoption much easier than something you'd have to host on bare metal. One question though, does this mean the open-source version is basically deprecated now and further development is closed-source, or does the open-source project represent the "compute" part of the "next gen Materialize"?
Congrats and good luck with further development!
The core components of Materialize have always been licensed under the BSL [0], with a conversion to the Apache v2.0 License four years from the date of publication. These components (now called the storage, compute, and adapter layers) continue to be developed in the open under the same license, but are no longer packaged or supported for on-premise deployment.
Timely Dataflow and Differential Dataflow [1] are the open source frameworks that have always been at the heart of the compute layer in Materialize.
If you're referencing the downloadable binary when you mentioned "the open-source version", we decided the opportunity cost of splitting our attention and continuing to release updated binary versions by "rebundling" Materialize is too high. We have a small team and ambitious plans for Materialize, (maybe some `WITH RECURSIVE`?) so we don't plan to release updates to the binary.
[0] https://github.com/MaterializeInc/materialize/blob/main/LICE... [1] https://timelydataflow.github.io/differential-dataflow/intro...
But unfortunately they only seem to be interested in enterprise customers: Not only is it not open source, but there's no open sign up and pricing for compute nodes isn't publicly available!
We support both `INSERT ... RETURNING` [0] and the `jsonb` data type [1]. The only feature in your list that we're actually missing is UPSERT (i.e., `INSERT ... ON CONFLICT`). We have a tracking issue [2] if you're interested in following along.
> But unfortunately they only seem to be interested in enterprise customers: Not only is it not open source, but there's no open sign up and pricing for compute nodes isn't publicly available!
Rest assured: we're working towards open sign up! We're at the very beginning of our early access period.
[0]: https://materialize.com/docs/sql/insert/#details
[1]: https://materialize.com/docs/sql/types/jsonb/
[2]: https://github.com/MaterializeInc/materialize/issues/6668
I really hope materialize eventually inspires something like it in the open source world.
With Materialize the database, it really depends on price/performance whether I could use it (could it be really cheap when idle, like Aurora Serverless?), but two things that would make it easier to use would be the ability to purchase it through AWS Marketplace and to deploy it/use it in my organization’s own AWS account. As an enterprise dev team senior manager I then do not have to go through a vendor approval process or deal with my procurement department, nor do I need to worry about third party data control.
Bad for big data. Great for small and simple data sets. But who is using Kafka with small data?
Also they do not integrate at all with custom data types in Postgres IME. E.g. an enumeration in your table will mean materialize can’t read the table as a source. Lame.
> Also they do not integrate at all with custom data types in Postgres IME. E.g. an enumeration in your table will mean materialize can’t read the table as a source. Lame.
We're aware of this and are working on a fix. There are two tracking issues, if you'd like to follow along:
* #6818 (https://github.com/MaterializeInc/materialize/issues/6818) is specifically about supporting PostgreSQL enum types * #15073 (https://github.com/MaterializeInc/materialize/issues/15073) is about handling PostgreSQL types that are unknown to Materialize in a more general purpose way *
This means hand-coding triggers to keep the materializations up to date, or else to mark them as out of date (because maybe some operations would be slow or hard to hand-code triggers for), but this works remarkably well.
As a bonus, I get an update history table that can be used to generate updates to external systems.
In principle one can get the AST for a VIEW's query from the PG catalog and use that generate triggers on the tables it queries to keep it up to date. In practice that's only trivial for some kinds of queries, and I've not written such a tool yet.
[0] https://github.com/twosigma/postgresql-contrib/blob/master/m...
Also I love the tagline "Consistency, Scalability, Low Latency: Pick Three"
I checked the docs but they mention streaming DB, Timely and Differential Dataflow which I don't know either.
That said, do you support persistent tables or time travel? (E.g., accessing the contents of the view as it was at time t). If not, how should a situation where multiple independent clients read the view be handled? If two people load the same dashboard but see differences based on when the table was read, that might cause confusion.