MariaDB Temporal Data Tables (opens in new tab)

(mariadb.com)

124 pointsalecbenzer5y ago43 comments

43 comments

I've been begging for exactly this for quite some time. Because of the way I use databases, I've always been bewildered why this wasn't a core part of SQL from the very beginning.

From what I'm reading there's still a lot to be fleshed out to be maximally useful to me, but even in its current state I could imagine using this.

— I'd like to have a field property that limits stored values to a single version and thus is automatically cleared whenever the row is updated. This would be useful for inlining change annotations, and for associating a user_id to specific changes.

— I'd like to be able to arbitrarily select the n-1 value of fields regardless of their time period. E.g.

  select username, previous(username)
  from users

— When viewing a specific version, I'd like to know whether a field's value was supplied in that revision. That's distinct from if the field was changed. I want to know if the value was supplied—even if it was identical to the previous value.

— This might be possible already (it's hard to tell) but I'd like to be able to query/join on any revision. For example I might want to ask the question "show me all products that james has ever modified". That could then get more specific, e.g. "show me all products where james changed the price".

refset5y ago

> I've always been bewildered why this wasn't a core part of SQL from the very beginning.

It's a long and messy history (no pun intended), but essentially it was rarely practical to consider retaining database history for the first few decades of SQL, due to physical storage costs & limitations. Snodgrass and Jensen proposed initial bitemporal extensions in the 90s and lot of research was done subsequently, but most vendors didn't make their move until the 2011 standard was formed (Oracle Flashback being the most notable exception). Unfortunately the rollout of the 2011 temporal standard has been underwhelming across the board, as each vendor ended up implementing something subtly different, which I think has massively hindered adoption. Since then I would guess that "immutability" has been the largest driving force behind the resurgence of interest.

sjwright5y ago

> it was rarely practical to consider retaining database history for the first few decades of SQL

That does make sense from a historical perspective and I don't doubt that's why. But still I find it unsatisfying because any competent database schema will always retain the history that needs to be retained. If you don't have the storage capacity, you choose to not store so much history. If you don't have native concepts for storing history, you kludge it yourself.

Whether you have native temporal support or have to kludge a DIY solution in the schema, the data you need to store gets stored.

My frustration is that I feel that temporal concepts should have been deeply native to SQL right to its core. History should have been as fundamental to database design as columns and rows. It should be a thing you turn off when you don't want it, not a thing you turn on when you do.

1 more reply

waheoo5y ago

Joining on self by a different time range from the current is probably doable?

docsapp_io5y ago

I really hope Postgres can support temporal table out of the box. Temporal table can simplify development for the feature that need audits.

mulander5y ago

Funny historical and architecture fact about PostgreSQL. It actually can do this, for all tables without special features. Unfortunately the facility to perform a query like this is no longer exposed but it shouldn't be impossible to re-add in a more modern way.

Essentially PostgreSQL has copy-on-write semantics, so historical records exist unless a vacuum marks them as no longer needed and subsequent insert/updates overwrite the values.

In the past when PostgreSQL had the postquel language (before SQL was added) there was special syntax to access data at specific points in time:

This is nicely outlined in "THE IMPLEMENTATION OF POSTGRES" by Michael Stonebraker, Lawrence A. Rowe and Michael Hirohama[1]. Go ahead open the PDF and search for "time travel" or read the quotes below.

> The second benefit of a no-overwrite storage manager is the possibility of time travel. As noted earlier, a user can ask a historical query and POSTGRES will automatically return information from the record valid at the correct time.

Quoting the paper again:

> For example to find the salary of Sam at time T one would query:

    retrieve (EMP.salary)
    using EMP [T]
    where EMP.name = "Sam"

> POSTGRES will automatically find the version of Sam’s record valid at the correct time and get the appropriate salary.

[1] - https://dsf.berkeley.edu/papers/ERL-M90-34.pdf

jarym5y ago

Really nice background, thanks for sharing! I knew Postgres did CoW internally and always wondered why the SQL standard for time-travel queries was not implemented.

I am using triggers and audit tables which works but my data requirements are relatively small so I won't face any challenges that way. However, re-using the old rows like this would lead to a far more efficient approach if it were supported natively.

mildbyte5y ago

Shameless plug (I'm a co-founder) but this is basically what we've built with Splitgraph[0]: we can add change tracking to tables using PostgreSQL's audit triggers and let the user switch between different versions of the table / query past versions.

[0] https://www.splitgraph.com/product/data-lifecycle/research

jacques_chester5y ago

Change tracking is not a fully bitemporal scheme, though. A bitemporal table tracks _two_ timelines. One is about when facts in the world were true ("valid time" or "application time"), the other is about the history of particular records in the database ("transaction time" or "system time"). Change tracking can only capture the second.

1 more reply

refset5y ago

That sounds neat. What does the performance of querying past versions look like? For instance, is lookup time linear with the amount of history or do you maintain special temporal indexes?

1 more reply

gen2205y ago

I work at a company where (many years ago) we built an extension to Postgres (and some helper libs in SQLAlchemy, Go) for implementing decently-performant bitemporal tables (biggest history tables have hundreds of millions of rows). Pretty much our entire company runs on it today.

We implemented the “minimum viable” features (i.e. automatic expiring, non-destructive updates, generated indexes and generated table declarations), but left some of the “harder” ideas up to the application designer (adding semantic versioning on top of temporal versioning, schema migrations).

It’s worked really well for us. I can’t think of anything we’ve done that’s had a higher ROI than this. I’ll really miss it when I leave!

eyelidlessness5y ago

This is the kind of thing I always design with the possibility of open sourcing in mind, even if I don’t have buy in or dedicated time to make the open source effort at that moment. Even if you miss it when you’re gone, you’ll have the benefit of hindsight of where the boundaries are between your own business needs and the more general use case, and can take that with you and apply the same lessons (often with improvements) the next time you face a similar problem.

refset5y ago

System time (aka "transaction time") is also invaluable for debugging if you annotate it with release versions. Unless an application is particularly strapped for storage costs, which is rare in this day and age, it ought to be the default choice to use built-in system time versioning wherever it exists.

alecbenzerOP5y ago

Had no idea until recently that MariaDB supported this out of the box. Does anyone have experience using this? How does it compare to https://github.com/scalegenius/pg_bitemporal ?

amluto5y ago

> mysqldump does not read historical rows from versioned tables, and so historical data will not be backed up. Also, a restore of the timestamps would not be possible as they cannot be defined by an insert/a user.

Given this caveat, this seems unusable for production systems.

crazygringo5y ago

Well, conceptually this makes sense for what mysqldump is.

I'm guessing that "backups" would actually have to be live replicas set up from the start, and if the master fails, you convert a replica to master.

In addition, you could perform actual static backups by pausing a replica, backing up the actual table files themselves, then resuming the replica (and it will catch up). In case of total failure, you just dump the table files into a fresh install of MariaDB. (Copying database files is a common technique for migrating data, not just SQL command import/export.)

Is there any reason why these wouldn't work?

PixyMisa5y ago

Or ZFS snapshots, for example.

From the description it looks like it would be easy to do backups, it's just that mysqldump is not currently aware of temporal tables.

Just use

SELECT * FROM t FOR SYSTEM_TIME ALL;

And export it in an appropriate format.

1 more reply

mathnode5y ago

If you are using Mariabackup or a volume snapshot, then you retain the history.

shivekkhurana5y ago

I’m very happy to see an open source dB which can do something similar to Datomic/Crux, but is not tied to Clojure. It doesn’t seem as sophisticated but I hope this project grows.

For anyone wondering why temporality matters and how this is different from adding a “create_time” to each row, I would highly recommend watching Rich Hickey’s talk title, “Value of Values”

TekMol5y ago

Is there a diff tool? Like show me all differences between now and 5 minutes ago?

Could be nice to see what magic goes on behind the scene in some applications.

For example when you do some clicks in the backend of WordPress and wonder what it actually did to the data.

crazygringo5y ago

This is fascinating. I've got two basic questions, however:

1) Is this always going to be performant with indices? It seems like "time" is kind of like another index here, and when designing queries which indices are used and in which order can be the difference between taking milliseconds and taking an hour. It's not obvious to me whether this will have hidden gotchas or query execution complexities, or if it's designed in a way that's so restricted and integrated into indices themselves that query performance will always remain within the same order of magnitude

2) What is the advantage of building this into the database, instead of adding your own timestamp columns e.g. 'created_timestamp' and 'expunged_timestamp'? Not only does that seem relatively simple, but it gives you the flexibility of creating indices across multiple columns (including them) for desired performance, the ability to work with tools like mysqldump, and it's just conceptually simpler to understand the database. And if the question is data security, is there a real difference between a "security layer" that is built around the database, versus one built into it? It would be fairly simple to write a command-line tool to change the MariaDB data files directly, no?

baq5y ago

re 2) - it is a complex topic but in short, the queries get really complex really fast for anything other than a simple select. see http://www2.cs.arizona.edu/~rts/tdbbook.pdf.

also, DDL migrations become a nightmare.

alecbenzerOP5y ago

> Not only does that seem relatively simple

I haven't thought about this too deeply, but I think "simple" is overstating it. Being able to turn on versioning for any table by basically just pushing a button seems really powerful.

There's application-layer stuff like paper_trail for rails that can do this for you, but you're stuck if your language doesn't have a good one.

Building it into the db also means that any out-of-band direct edits to the DB also get tracked.

jacques_chester5y ago

> What is the advantage of building this into the database, instead of adding your own timestamp columns e.g. 'created_timestamp' and 'expunged_timestamp'?

If it's present in every table, the database can be optimised for it.

polskibus5y ago

How does this feature compare to MS SQL's Temporal Tables https://docs.microsoft.com/en-us/sql/relational-databases/ta...?

This feature seems to be well fitted to support some of the cases where event sourcing is introduced, I wonder if someone successfully applied event sourcing with use of temporal tables to reduce the amount of work that has to be done in the application code (Akka, etc.).

deleuze5y ago

When we looked at temporal tables in SQL Server for event sourcing, I was put off by the fact that you have to read from multiple tables. CDC + some external data source still seems to be the better solution here, imo.

polskibus5y ago

What do you use for your event sourcing? Do you use Akka/Akka.NET Persistence or some other application framework?

1 more reply

Drdrdrq5y ago

I understand the benefits of this feature for audits, but how does one deal with GDPR requirements? Is there some way to alter historic data to remove PII, or should the affected columns be excluded?

satyrnein5y ago

Possibly the idea of "crypto-shredding" could apply, where the PII values are encrypted and you throw away the key if you get a delete request.

PixyMisa5y ago

You can't alter historic data, but you can include or exclude just selected columns from versioning. You can also purge all history by date range, but not apparently just the history for a given record.

ec1096855y ago

There are gdpr exceptions for use cases like audit trails, so if there is a requirement to keep the data, you can.

It’s an excellent point to be aware of.

beckingz5y ago

MariaDB continues to be great.

Now all they need is materialized views and they'll be close to postgres.

Xlurker5y ago

TimescaleDB competitor?

grzm5y ago

TimescaleDB is for time series data. Temporal data tables are for “versioning” data; for example, being able to query the state of a database as-of a certain time.

https://en.wikipedia.org/wiki/Time_series_database

https://en.wikipedia.org/wiki/Temporal_database

j / k navigate · click thread line to collapse

43 comments

sjwright5y ago

I've been begging for exactly this for quite some time. Because of the way I use databases, I've always been bewildered why this wasn't a core part of SQL from the very beginning.

From what I'm reading there's still a lot to be fleshed out to be maximally useful to me, but even in its current state I could imagine using this.

— I'd like to be able to arbitrarily select the n-1 value of fields regardless of their time period. E.g.

  select username, previous(username)
  from users

refset5y ago

> I've always been bewildered why this wasn't a core part of SQL from the very beginning.

sjwright5y ago

> it was rarely practical to consider retaining database history for the first few decades of SQL

Whether you have native temporal support or have to kludge a DIY solution in the schema, the data you need to store gets stored.

1 more reply

waheoo5y ago

Joining on self by a different time range from the current is probably doable?

docsapp_io5y ago

I really hope Postgres can support temporal table out of the box. Temporal table can simplify development for the feature that need audits.

mulander5y ago

Essentially PostgreSQL has copy-on-write semantics, so historical records exist unless a vacuum marks them as no longer needed and subsequent insert/updates overwrite the values.

In the past when PostgreSQL had the postquel language (before SQL was added) there was special syntax to access data at specific points in time:

Quoting the paper again:

> For example to find the salary of Sam at time T one would query:

    retrieve (EMP.salary)
    using EMP [T]
    where EMP.name = "Sam"

> POSTGRES will automatically find the version of Sam’s record valid at the correct time and get the appropriate salary.

[1] - https://dsf.berkeley.edu/papers/ERL-M90-34.pdf

jarym5y ago

Really nice background, thanks for sharing! I knew Postgres did CoW internally and always wondered why the SQL standard for time-travel queries was not implemented.

mildbyte5y ago

[0] https://www.splitgraph.com/product/data-lifecycle/research

jacques_chester5y ago

1 more reply

refset5y ago

That sounds neat. What does the performance of querying past versions look like? For instance, is lookup time linear with the amount of history or do you maintain special temporal indexes?

1 more reply

gen2205y ago

It’s worked really well for us. I can’t think of anything we’ve done that’s had a higher ROI than this. I’ll really miss it when I leave!

eyelidlessness5y ago

refset5y ago

alecbenzerOP5y ago

Had no idea until recently that MariaDB supported this out of the box. Does anyone have experience using this? How does it compare to https://github.com/scalegenius/pg_bitemporal ?

amluto5y ago

Given this caveat, this seems unusable for production systems.

crazygringo5y ago

Well, conceptually this makes sense for what mysqldump is.

I'm guessing that "backups" would actually have to be live replicas set up from the start, and if the master fails, you convert a replica to master.

Is there any reason why these wouldn't work?

PixyMisa5y ago

Or ZFS snapshots, for example.

From the description it looks like it would be easy to do backups, it's just that mysqldump is not currently aware of temporal tables.

Just use

SELECT * FROM t FOR SYSTEM_TIME ALL;

And export it in an appropriate format.

1 more reply

mathnode5y ago

If you are using Mariabackup or a volume snapshot, then you retain the history.

shivekkhurana5y ago

I’m very happy to see an open source dB which can do something similar to Datomic/Crux, but is not tied to Clojure. It doesn’t seem as sophisticated but I hope this project grows.

For anyone wondering why temporality matters and how this is different from adding a “create_time” to each row, I would highly recommend watching Rich Hickey’s talk title, “Value of Values”

TekMol5y ago

Is there a diff tool? Like show me all differences between now and 5 minutes ago?

Could be nice to see what magic goes on behind the scene in some applications.

For example when you do some clicks in the backend of WordPress and wonder what it actually did to the data.

crazygringo5y ago

This is fascinating. I've got two basic questions, however:

baq5y ago

re 2) - it is a complex topic but in short, the queries get really complex really fast for anything other than a simple select. see http://www2.cs.arizona.edu/~rts/tdbbook.pdf.

also, DDL migrations become a nightmare.

alecbenzerOP5y ago

> Not only does that seem relatively simple

I haven't thought about this too deeply, but I think "simple" is overstating it. Being able to turn on versioning for any table by basically just pushing a button seems really powerful.

There's application-layer stuff like paper_trail for rails that can do this for you, but you're stuck if your language doesn't have a good one.

Building it into the db also means that any out-of-band direct edits to the DB also get tracked.

jacques_chester5y ago

> What is the advantage of building this into the database, instead of adding your own timestamp columns e.g. 'created_timestamp' and 'expunged_timestamp'?

If it's present in every table, the database can be optimised for it.

polskibus5y ago

How does this feature compare to MS SQL's Temporal Tables https://docs.microsoft.com/en-us/sql/relational-databases/ta...?

deleuze5y ago

polskibus5y ago

What do you use for your event sourcing? Do you use Akka/Akka.NET Persistence or some other application framework?

1 more reply

Drdrdrq5y ago

I understand the benefits of this feature for audits, but how does one deal with GDPR requirements? Is there some way to alter historic data to remove PII, or should the affected columns be excluded?

satyrnein5y ago

Possibly the idea of "crypto-shredding" could apply, where the PII values are encrypted and you throw away the key if you get a delete request.

PixyMisa5y ago

ec1096855y ago

There are gdpr exceptions for use cases like audit trails, so if there is a requirement to keep the data, you can.

It’s an excellent point to be aware of.

beckingz5y ago

MariaDB continues to be great.

Now all they need is materialized views and they'll be close to postgres.

Xlurker5y ago

TimescaleDB competitor?

grzm5y ago

TimescaleDB is for time series data. Temporal data tables are for “versioning” data; for example, being able to query the state of a database as-of a certain time.

https://en.wikipedia.org/wiki/Time_series_database

https://en.wikipedia.org/wiki/Temporal_database

j / k navigate · click thread line to collapse