Somewhat related question and apologies if this is already stated in the documentation (it's rather dense and I haven't had the time to read through it completely)...
Can you use sqlglot to create custom DDL dialects that have custom first class objects? For instance, if I want to build a custom SQL/DDL/DML dialect that had a new kind of object such as a "pipe", "kitchensink", etc, would sqlglot be a good tool to use?
I've tried playing around with Apache Calcite, but it lost me pretty quickly since the examples to customize/extend DDLs were quite lacking in my opinion.
Here's an example of how we use SQLGlot to output raw Python code directly from SQL.
https://github.com/tobymao/sqlglot/blob/main/sqlglot/executo...
Thanks very much, looking forward to spending some time reading through all of this!
Side note: Iaroslav (post author) and Toby (sqlglot creator) are both amazing, and I'm so glad that they're working on open source projects like this.
In terms of versioning of engines, I haven't implemented that yet, but presumably it could be done by adding a dialect subclass and having versioning route to it, so we could do something like parse(sql, dialect="spark", version=...) which could then route to a 2.3 version of spark.
Happy to chat more about this and we can see about adding it (or feel free to make a pr). You can DM me on twitter or some other avenue as well if you want to dive in deep.
I've implemented the Fast Match / Simple Edit script algorithm almost 10 years ago for my Master's thesis[1] for my database project[1][2] in order to import revisions of files with a hopefully minimal edit number of edit operations between the stored revision and a new one (back then it was for XML databases).
The diffing was only one aspect for the visual analytics approach to compare the revisions (tree structures) visually [4]. Internally the nodes are addressed through dense, ascending 64bit ints stored in a special trie index. Furthermore, during the import optionally changes are tracked as well as a rolling hash is stored for each node optionally. After the import you can query the changes or execute time travel queries easily.
Technically, a tree of tries is mapped to an append-only data file using a persistent data structure (in the functional sense), COW with path copying and a novel sliding snapshot algorithm for the leaf data pages itself.
I always have the vision to implement different visualizations to compare the revisions in a web frontend, but I'm currently spending my time on improving the latency of both writes and reads.
Thus, if someone would like to help, that would be awesome :-)
Kind regards
Johannes
[1] https://github.com/JohannesLichtenberger/master-thesis/blob/...
[2] https://github.com/sirixdb/sirix
[3] https://github.com/sirixdb/sirix/tree/master/bundles/sirix-c...
So if you run the optimizer first and then the diff tool, it could solve this kind of use case.
EDIT: I can't find it, I searched for 30 mins, I promise this exists though. If anyone else can remember the name of it, please post.
https://pganalyze.com/blog/pg-query-2-0-postgres-query-parse...
Scroll down to the section marked "Fingerprints in...".
P.S: We work on DiffLens. It currently supports TS, JS, CSS and text diffs. We're working on making a VS Code extension currently
> Semantically the query hasn’t changed
Now hang on a minute. Extend that to 3 and it can be (mssql but true in any I guess):
declare @hi int = 2147483647;
declare @lo int = -2147483648;
declare @x int = @hi + @lo + @hi; -- ok
declare @y int = @hi + @hi + @lo; -- 'Arithmetic overflow error'
Worse yet with floats. I see what you're saying and good stuff, I'm thinking about this myself and I appreciate this article and will read it properly, but the edge cases have to be acknowledged.Edit: this kind of thing is apparently something compiler writers keep rediscovering the hard way.
@hi + @lo + @hi means (@hi + @lo) + @hi
@hi + @hi + @lo means (@hi + @hi) + @lo
Just because you can write this without the parentheses does not mean you can ignore them. To get from one to the other need not just the commutativity rule relied on by the diffing algorithm but also the associativity rule:
(@hi + @lo) + @hi =assoc=> @hi + (@lo + @hi) =comm=> @hi + (@hi + @lo) =assoc=> (@hi + @hi) + @lo
Here the third (associativity) change introduces the overflow, not the commutativity change which should always be safe for both integers and floating point numbers.
His example included SUM(b + c) and SUM(c + b) as equivalent. As such it probably is[1] but extend it to 3 numbers and you can't rearrange, as I think we agree.
[1] assuming b and c are numbers not strings, as + is string concat in some dialects
1. It's in JS and not Python. These days a common data (including data tooling) stack revolves around Python and fitting JS into this ecosystem is not straightforwad.
2. Limited dialect support. As far as I can see it only supports "PostgreSQL flavor" (not sure what exactly is meant by "flavor" here). Support for dialects like Spark, Trino, Hive, etc SQL was crucial.
Definitely a worthy mention, though, thank you!
(I've been working on a similar problem, effectively diffing an XML tree.)
> when a nested query is refactored into a common table expression (CTE), this kind of change doesn’t have any functional impact on either a query or its outcome
This isn’t quite true, at least in Postgres. It won’t affect the outcome, but it can affect the query plan.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...
Text-based diff tools such as git diff, when applied to a code base, have certain limitations. First, they can only detect insertions and deletions, not movements or updates of individual pieces of code.
git diff can detect movements. looking at my .gitconfig, I think it's the "frag = magenta" line.