Semantic Diff for SQL (opens in new tab)

(github.com)

228 pointss0ck_r4w3y ago31 comments

31 comments

Very nice/interesting.

Somewhat related question and apologies if this is already stated in the documentation (it's rather dense and I haven't had the time to read through it completely)...

Can you use sqlglot to create custom DDL dialects that have custom first class objects? For instance, if I want to build a custom SQL/DDL/DML dialect that had a new kind of object such as a "pipe", "kitchensink", etc, would sqlglot be a good tool to use?

I've tried playing around with Apache Calcite, but it lost me pretty quickly since the examples to customize/extend DDLs were quite lacking in my opinion.

captaintobs3y ago

Yes. SQLGlot is very customizable and you can pretty much override everything. It kind of needs to be flexible because even common sql dialects vary greatly.

Here's an example of how we use SQLGlot to output raw Python code directly from SQL.

https://github.com/tobymao/sqlglot/blob/main/sqlglot/executo...

chrisjc3y ago

Very nice, and exactly along the lines of what I was thinking... I want to be able to create a custom SQL dialect that can output some code.

Thanks very much, looking forward to spending some time reading through all of this!

stochtastic3y ago

I've been very impressed with sqlglot, and am looking forward to trying this feature. The only issue I've had with sqlglot is transpiling for use with a specific spark version: in my experience Spark is not great about surfacing obvious 'not registered' errors when a function isn't supported (especially in >=2.4). I ran into this with width_bucket, which is only in the most recent release. I am curious whether there's a straightforward way to write with a specific release and catch the error in transpilation rather than execution.

Side note: Iaroslav (post author) and Toby (sqlglot creator) are both amazing, and I'm so glad that they're working on open source projects like this.

captaintobs3y ago

It's easy to add errors for dialects by calling #unsupported when a function is used.

In terms of versioning of engines, I haven't implemented that yet, but presumably it could be done by adding a dialect subclass and having versioning route to it, so we could do something like parse(sql, dialect="spark", version=...) which could then route to a 2.3 version of spark.

Happy to chat more about this and we can see about adding it (or feel free to make a pr). You can DM me on twitter or some other avenue as well if you want to dive in deep.

nerpderp823y ago

https://github.com/tobymao/sqlglot

lichtenberger3y ago

Really awesome work :-)

I've implemented the Fast Match / Simple Edit script algorithm almost 10 years ago for my Master's thesis[1] for my database project[1][2] in order to import revisions of files with a hopefully minimal edit number of edit operations between the stored revision and a new one (back then it was for XML databases).

The diffing was only one aspect for the visual analytics approach to compare the revisions (tree structures) visually [4]. Internally the nodes are addressed through dense, ascending 64bit ints stored in a special trie index. Furthermore, during the import optionally changes are tracked as well as a rolling hash is stored for each node optionally. After the import you can query the changes or execute time travel queries easily.

Technically, a tree of tries is mapped to an append-only data file using a persistent data structure (in the functional sense), COW with path copying and a novel sliding snapshot algorithm for the leaf data pages itself.

I always have the vision to implement different visualizations to compare the revisions in a web frontend, but I'm currently spending my time on improving the latency of both writes and reads.

Thus, if someone would like to help, that would be awesome :-)

Kind regards

Johannes

[1] https://github.com/JohannesLichtenberger/master-thesis/blob/...

[2] https://github.com/sirixdb/sirix

[3] https://github.com/sirixdb/sirix/tree/master/bundles/sirix-c...

[4] https://youtube.com/watch?v=l9CXXBkl5vI

samokhvalov3y ago

Comparing tree structures can be used to have diff of EXPLAIN plans. During query optimization, it might make a lot of sense.

karmakaze3y ago

I thought this was going to be something else like being able to tell that a rewritten query returns the same set of rows, but with potentially a very different query plan. E.g. dependent EXISTS subquery vs IN subquery.

captaintobs3y ago

This is what the sqlglot optimizer is used for. The optimizer converts EXISTS and IN into a canonicalized SQL (some variant of left join) which can then be compared to another query.

So if you run the optimizer first and then the diff tool, it could solve this kind of use case.

gavinray3y ago

This exists for Postgres, let me try to find the name of the tool

EDIT: I can't find it, I searched for 30 mins, I promise this exists though. If anyone else can remember the name of it, please post.

tyingq3y ago

Pg_query maybe?

https://pganalyze.com/blog/pg-query-2-0-postgres-query-parse...

Scroll down to the section marked "Fingerprints in...".

1 more reply

difflens3y ago

Interesting, will give sqlglot a look when we get to adding SQL support in DiffLens [https://github.com/marketplace/difflens]. Or perhaps DiffLens can just use sqlglot :) Either way we're very happy to see another semantic diff tool.

P.S: We work on DiffLens. It currently supports TS, JS, CSS and text diffs. We're working on making a VS Code extension currently

noisy_boy3y ago

I wonder if this is a topical thread to check if anyone is aware of a Java based solution to parse a CREATE VIEW statement to get a mapping between the view columns and the corresponding source table columns. I checked out jsqlparser[0] and it does produce an AST which can be parsed using the visitor-pattern[1] but was wondering if there is a more "out-of-the-box" solution involving less work. Due to various reasons, querying the database information schema is not an option I can pursue.

[0]: https://github.com/JSQLParser/JSqlParser

[1]: https://en.wikipedia.org/wiki/Visitor_pattern

benjiweber3y ago

What do you mean by a mapping with a source table columns? The data columns in the view could come from 0 to n source tables. Entirely synthetic, transformed, combined data from multiple tables etc.

noisy_boy3y ago

True - I mean the view columns that _are_ from actual tables. Could be straight up table columns as is or may be part of function like max(a.abcd) where a is an alias to table xyz

gavinray3y ago

Apache Calcite can do this, though it's not a beginner-friendly task:

https://calcite.apache.org/

zasdffaa3y ago

> (a + b) => (b + a)

> Semantically the query hasn’t changed

Now hang on a minute. Extend that to 3 and it can be (mssql but true in any I guess):

   declare @hi int =  2147483647;
   declare @lo int = -2147483648;

   declare @x int = @hi + @lo + @hi;  -- ok

   declare @y int = @hi + @hi + @lo; -- 'Arithmetic overflow error'

Worse yet with floats. I see what you're saying and good stuff, I'm thinking about this myself and I appreciate this article and will read it properly, but the edge cases have to be acknowledged.

Edit: this kind of thing is apparently something compiler writers keep rediscovering the hard way.

account423y ago

Associativity and communtativity are different things.

@hi + @lo + @hi means (@hi + @lo) + @hi

@hi + @hi + @lo means (@hi + @hi) + @lo

Just because you can write this without the parentheses does not mean you can ignore them. To get from one to the other need not just the commutativity rule relied on by the diffing algorithm but also the associativity rule:

(@hi + @lo) + @hi =assoc=> @hi + (@lo + @hi) =comm=> @hi + (@hi + @lo) =assoc=> (@hi + @hi) + @lo

Here the third (associativity) change introduces the overflow, not the commutativity change which should always be safe for both integers and floating point numbers.

zasdffaa3y ago

You're right but I'm just making the point that machine arithmetic ain't number arithmetic and you can't rearrange freely.

His example included SUM(b + c) and SUM(c + b) as equivalent. As such it probably is[1] but extend it to 3 numbers and you can't rearrange, as I think we agree.

[1] assuming b and c are numbers not strings, as + is string concat in some dialects

AeroNotix3y ago

What about difftastic?

s0ck_r4wOP3y ago

Difftastic seems like a really cool tool. There are a few reasons, however, why it doesn't apply well to use cases I had in mind:

1. It's in JS and not Python. These days a common data (including data tooling) stack revolves around Python and fitting JS into this ecosystem is not straightforwad.

2. Limited dialect support. As far as I can see it only supports "PostgreSQL flavor" (not sure what exactly is meant by "flavor" here). Support for dialects like Spark, Trino, Hive, etc SQL was crucial.

Definitely a worthy mention, though, thank you!

rmccue3y ago

Difftastic's wiki also has a breakdown of some structured diff algorithms: https://github.com/Wilfred/difftastic/wiki/Structural-Diffs

(I've been working on a similar problem, effectively diffing an XML tree.)

leeoniya3y ago

> It's in JS and not Python

it's in Rust:

https://github.com/Wilfred/difftastic

what am i missing?

sk1pper3y ago

Nitpick:

> when a nested query is refactored into a common table expression (CTE), this kind of change doesn’t have any functional impact on either a query or its outcome

This isn’t quite true, at least in Postgres. It won’t affect the outcome, but it can affect the query plan.

zasdffaa3y ago

I believe that was true but in current PG the CTE no longer acts as an optimisation barrier.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...

captaintobs3y ago

For this use case, it’s intentional, since only the outcome is important to us. That’s kind of the draw of a declarative language, you ask what you want and don’t have too much control over how that’s done (the engine should optimize that away).

tessierashpool3y ago

this is very cool, but I believe this bit of the README is incorrect:

Text-based diff tools such as git diff, when applied to a code base, have certain limitations. First, they can only detect insertions and deletions, not movements or updates of individual pieces of code.

git diff can detect movements. looking at my .gitconfig, I think it's the "frag = magenta" line.

seaknoll3y ago

I think color-moved might also be what you’re thinking of. https://git-scm.com/docs/git-diff#Documentation/git-diff.txt...

Something12343y ago

Post it or link it so we can better understand!

j / k navigate · click thread line to collapse

31 comments

chrisjc3y ago

Very nice/interesting.

Somewhat related question and apologies if this is already stated in the documentation (it's rather dense and I haven't had the time to read through it completely)...

I've tried playing around with Apache Calcite, but it lost me pretty quickly since the examples to customize/extend DDLs were quite lacking in my opinion.

captaintobs3y ago

Yes. SQLGlot is very customizable and you can pretty much override everything. It kind of needs to be flexible because even common sql dialects vary greatly.

Here's an example of how we use SQLGlot to output raw Python code directly from SQL.

https://github.com/tobymao/sqlglot/blob/main/sqlglot/executo...

chrisjc3y ago

Very nice, and exactly along the lines of what I was thinking... I want to be able to create a custom SQL dialect that can output some code.

Thanks very much, looking forward to spending some time reading through all of this!

stochtastic3y ago

Side note: Iaroslav (post author) and Toby (sqlglot creator) are both amazing, and I'm so glad that they're working on open source projects like this.

captaintobs3y ago

It's easy to add errors for dialects by calling #unsupported when a function is used.

Happy to chat more about this and we can see about adding it (or feel free to make a pr). You can DM me on twitter or some other avenue as well if you want to dive in deep.

nerpderp823y ago

https://github.com/tobymao/sqlglot

lichtenberger3y ago

Really awesome work :-)

I always have the vision to implement different visualizations to compare the revisions in a web frontend, but I'm currently spending my time on improving the latency of both writes and reads.

Thus, if someone would like to help, that would be awesome :-)

Kind regards

Johannes

[1] https://github.com/JohannesLichtenberger/master-thesis/blob/...

[2] https://github.com/sirixdb/sirix

[3] https://github.com/sirixdb/sirix/tree/master/bundles/sirix-c...

[4] https://youtube.com/watch?v=l9CXXBkl5vI

samokhvalov3y ago

Comparing tree structures can be used to have diff of EXPLAIN plans. During query optimization, it might make a lot of sense.

karmakaze3y ago

captaintobs3y ago

This is what the sqlglot optimizer is used for. The optimizer converts EXISTS and IN into a canonicalized SQL (some variant of left join) which can then be compared to another query.

So if you run the optimizer first and then the diff tool, it could solve this kind of use case.

gavinray3y ago

This exists for Postgres, let me try to find the name of the tool

EDIT: I can't find it, I searched for 30 mins, I promise this exists though. If anyone else can remember the name of it, please post.

tyingq3y ago

Pg_query maybe?

https://pganalyze.com/blog/pg-query-2-0-postgres-query-parse...

Scroll down to the section marked "Fingerprints in...".

1 more reply

difflens3y ago

P.S: We work on DiffLens. It currently supports TS, JS, CSS and text diffs. We're working on making a VS Code extension currently

noisy_boy3y ago

[0]: https://github.com/JSQLParser/JSqlParser

[1]: https://en.wikipedia.org/wiki/Visitor_pattern

benjiweber3y ago

What do you mean by a mapping with a source table columns? The data columns in the view could come from 0 to n source tables. Entirely synthetic, transformed, combined data from multiple tables etc.

noisy_boy3y ago

True - I mean the view columns that _are_ from actual tables. Could be straight up table columns as is or may be part of function like max(a.abcd) where a is an alias to table xyz

gavinray3y ago

Apache Calcite can do this, though it's not a beginner-friendly task:

https://calcite.apache.org/

zasdffaa3y ago

> (a + b) => (b + a)

> Semantically the query hasn’t changed

Now hang on a minute. Extend that to 3 and it can be (mssql but true in any I guess):

   declare @hi int =  2147483647;
   declare @lo int = -2147483648;

   declare @x int = @hi + @lo + @hi;  -- ok

   declare @y int = @hi + @hi + @lo; -- 'Arithmetic overflow error'

Worse yet with floats. I see what you're saying and good stuff, I'm thinking about this myself and I appreciate this article and will read it properly, but the edge cases have to be acknowledged.

Edit: this kind of thing is apparently something compiler writers keep rediscovering the hard way.

account423y ago

Associativity and communtativity are different things.

@hi + @lo + @hi means (@hi + @lo) + @hi

@hi + @hi + @lo means (@hi + @hi) + @lo

(@hi + @lo) + @hi =assoc=> @hi + (@lo + @hi) =comm=> @hi + (@hi + @lo) =assoc=> (@hi + @hi) + @lo

Here the third (associativity) change introduces the overflow, not the commutativity change which should always be safe for both integers and floating point numbers.

zasdffaa3y ago

You're right but I'm just making the point that machine arithmetic ain't number arithmetic and you can't rearrange freely.

His example included SUM(b + c) and SUM(c + b) as equivalent. As such it probably is[1] but extend it to 3 numbers and you can't rearrange, as I think we agree.

[1] assuming b and c are numbers not strings, as + is string concat in some dialects

AeroNotix3y ago

What about difftastic?

s0ck_r4wOP3y ago

Difftastic seems like a really cool tool. There are a few reasons, however, why it doesn't apply well to use cases I had in mind:

1. It's in JS and not Python. These days a common data (including data tooling) stack revolves around Python and fitting JS into this ecosystem is not straightforwad.

Definitely a worthy mention, though, thank you!

rmccue3y ago

Difftastic's wiki also has a breakdown of some structured diff algorithms: https://github.com/Wilfred/difftastic/wiki/Structural-Diffs

(I've been working on a similar problem, effectively diffing an XML tree.)

leeoniya3y ago

> It's in JS and not Python

it's in Rust:

https://github.com/Wilfred/difftastic

what am i missing?

sk1pper3y ago

Nitpick:

> when a nested query is refactored into a common table expression (CTE), this kind of change doesn’t have any functional impact on either a query or its outcome

This isn’t quite true, at least in Postgres. It won’t affect the outcome, but it can affect the query plan.

zasdffaa3y ago

I believe that was true but in current PG the CTE no longer acts as an optimisation barrier.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...

captaintobs3y ago

tessierashpool3y ago

this is very cool, but I believe this bit of the README is incorrect:

git diff can detect movements. looking at my .gitconfig, I think it's the "frag = magenta" line.

seaknoll3y ago

I think color-moved might also be what you’re thinking of. https://git-scm.com/docs/git-diff#Documentation/git-diff.txt...

Something12343y ago

Post it or link it so we can better understand!

j / k navigate · click thread line to collapse