Worth reading the thread, there are some good insights. It looks like he will be waiting on Postgres to take the initiative on implementing this before it makes it into a release.
> My goal is to keep SQLite relevant and viable through the year 2050. That's a long time from now. If I knew that standard SQL was not going to change any between now and then, I'd go ahead and make non-standard extensions that allowed for FROM-clause-first queries, as that seems like a useful extension. The problem is that standard SQL will not remain static. Probably some future version of "standard SQL" will support some kind of FROM-clause-first query format. I need to ensure that whatever SQLite supports will be compatible with the standard, whenever it drops. And the only way to do that is to support nothing until after the standard appears.
https://clickhouse.com/docs/en/guides/developer/alternative-...
Being able to do SELECT FROM WHERE in any order and allowing multiple WHEREs and AGGREGATE etc, combined with supporting trailing commas, makes copy pasting templating and reusing and code-generating SQL so much easier.
FROM table <-- at this point there is an implicit SELECT *
SELECT whatever
WHERE some_filter
WHERE another_filter <-- this is like AND
AGGREGATE something
WHERE a_filter_that_is_after_grouping <-- is like HAVING
ORDER BY ALL <-- group-by-all is great in engines that support it; want it for ordering too
...Really hope this takes off and gets more widespread adoption because I really want to stop doing:
SELECT *
FROM all_the_joins
into SELECT {my statements here}
FROM all_the_joinsWhile LINQ is mostly restricted to .NET, PRQL is not. https://prql-lang.org/
It's a welcome change in the industry.
I made this prediction a couple years back: https://x.com/tehlike/status/1517533067497201666
1. Similar to what you mention, while I think PRQL is pretty easy to learn if you know SQL, it still "feels" like a brand new language. This piped SQL syntax immediately felt awesome to me - it mapped how my brain likes to think about queries (essentially putting data through a chain of sieves and transforms), but all my knowledge of SQL felt like it just transferred over as-is.
2. I feel like I'm old enough now to know that the most critical thing for adoption of new technologies that are incremental improvements over existing technologies is to make the upgrade path as easy as possible. I shouldn't have to overhaul everything at once, but I just want to be able to take in small pieces a chunk at a time. While not 100% the same thing, if you look at the famously abysmal uptake of things like IPv6 and the pain it takes to use ES module-only distributions from NPM, the biggest pain point was these technologies made you do "all or nothing" migrations - they didn't have an easy, simple way to get from point A to point B. The thing I like about this piped SQL syntax is that in a large, existing code base I could easily just start adding this in new queries, but I wouldn't really feel the need to overhaul everything at once. With PRQL I'd feel a lot less enthusiastic about using that in existing projects where I'd have a mix of SQL and PRQL.
If you haven’t tried it, it’s great!
Kusto does seem interesting too, and i think some of the stuff i want to build will find a use for it!
I find piped queries both easier to write, and read.
I've never understood why copying text from digitally native PDFs (created directly from digital source files, rather than by OCR-ing scanned images) is so often such a poor experience. Even PDFs produced from LaTex often contain undesirable ligatures in the copied text like fi and fl. Text copied from some Springer journals sometimes lacks space between words or introduces unwanted space between letters in a word ... Is it due to something inherent in PDF technology?
Exactly. PDF doesn't have instructions to say "render this paragraph of text in this box", it has instructions to say "render each of these glyphs at each of these x,y coordinates".
It was never designed to have text extracted from it. So trying to turn it back into text involves a lot of heuristics and guesswork, like where enough separation between characters should be considered a space.
A lot also depends on what software produced the PDF, which can make it easier or harder to extract the text.
There are many good reasons why PDF has a “render glyph” instruction instead of a “render string”. In particular your printer and your PDF viewer should not need to have the same text shaping and layout algorithms in order for the PDF to render the same. Oops, your printer runs a different version of Harfbuzz!
The sibling comment is right that a lot depends on the software that produced the PDF. It’s important to be accurate about where the blame lies. I don’t blame the x86 ISA or the C++ standards committee when an Electron app uses too much memory.
Well formatted epub and HTML generally are usually intended to update to end user needs and better fit available layout space.
However, I don't know if XPS handles the copying of text better.
https://news.ycombinator.com/item?id=41321876 (first) https://news.ycombinator.com/item?id=41338877 (plenty of discussions)
I tried this new syntax and this seems a reasonable proposal for complex analytical queries. This new syntax probably does not change most simple transactional queries though. The syntax matches the execution semantic more closely, which means you less likely need to formulate query in a weird form to make query planner work as expected; usually users only need to move some pipe operators to more appropriate places.
I think they intentionally kept themselves away from massive redesign of the languages, which has a good chance of becoming multi decades of frustrating death march. I know a number of such cases from C++ standard proposals and probably the team wanted to avoid it.
To be honest, this feels exactly like the kind of mistake that IPv6 made. It wasn't just "let's extend the IPv4 address space and provide an upgrade path that's as incremental as possible", it was "IPv4 has all these problems, lets solve the address space issue with a completely new address space, and while we're at it lets fix 20 other things!" Meanwhile, over a quarter century later, IPv4 shows no signs of going away any time soon.
I'd much rather have an incremental improvement that solves 90% of my pain points than to reach for some "Let's throw all the old stuff away for this new nirvana!" And I say this as someone that really likes PRQL.
Just for once I want to see complete examples of the syntax on an actual advanced query of any kind right away. Sure, toss out one simple case, but then show me how it looks when I have to join 4-5 reference tables to a fact table and then filter based on those things.
Once you do that, it becomes clear why SELECT first won out originally: legibility and troubleshooting.
As long as DBs continue to support standard SQL they can add whatever additional syntax support they want but based on history this'll wind up being a whole new generation of emacs vs vi style holy war.
This isn't really fair. BeefWellington gave a reason why SQL is how it is (and how it has been for ~50 years). It's reasonable to ask for a compelling reason to change the clause order. Simon's post says it "has always been confusing", but doesn't really explain why except by linking to a blog post that says that the SQL engine (sort of but not really) executes the clauses in a different order.
I think the onus of proof that SQL clauses are in the wrong order is on the people who claim they're in the wrong order.
At my previous gig I worked for a decade with an application that meant creating and maintaining large hairy sql that was created to offload application logic to the database (_very_ original) And we used to talk about this "wrong order" often but I never once actually missed it. It was at the most a bit annoying when you jumped in a server to troubleshoot and you knew the two columns you were interested in and you could have saved two seconds. But when working with maintaining those massive queries it always felt good to have the projection up top because that is the end result and what the query is all about. I would not have liked if the method signature in eg Java was just the parameters and the return type was after the final brace. This analogy falls apart of course since params are all over the place but swapping things around wouldn't help.
So just go 'SELECT *...' and go back and expand later, I want my sql syntax "simple". /old developer
SQL has lots of warts, e.g.: the fact that you can write SQL that joins tables without including those tables in a JOIN, which leads to confusion. It's fragmented too -- the other example I posted shows two different syntaxes for TOP N / LIMIT N because different vendors went different ways. The fact that some RDBMSes provide locking hint mechanics and some don't (at least not reliably). The fact that there's no standard set of "library" functions defined anywhere, so porting between databases requires a lot of validation work. It makes portability hard, and some of those features are missing from standards.
You'll note I also mentioned that if they want to add it that's fine but it's gonna wind up being a point of contention in a lot of places. That's because I've seen the same thing happen with the "Big Data" vs "what we have works" crowd.
Having select up front avoids problems in a couple key ways:
1. App devs who are working on their application can immediately see what fields they should expect in their resultset. For CRUD, it's probably usually just whatever fields they selected or `*` because everyone's in the habit of asking for every field they'll never use.
2. Troubleshooting problems is far easier because they almost always stem from a field in the projection. Seeing the projected field list (and thus, table aliases that field comes from) are literally the first pieces of information you need (what field is it and where does that field come from) to start troubleshooting. This is why SELECT ... FROM makes the most sense -- it's literally the two most crucial pieces of information right up front.
3. Query planners already optimize and essentially compile the entire thing anyways, so legibility trumps other options IME.
Another point I'd make to you and everyone else bringing up autocomplete: If you need it, nothing is stopping you from writing your FROM clause first and then moving a line up to write your SELECT. Kinda like how you might stub out a function definition and later add arguments. This doesn't affect the final form for legibility.
nothing "becomes clear" just by you claiming so, better elaborate
And several more examples with pipe syntax here: https://github.com/google/zetasql/blob/master/zetasql/exampl...
Select first was as much an accident of "it sounded better as an English sentence" to the early SQL designers. Plus also they were working with early era parsers with very limited look ahead and putting the primary "verb" up front was important at the time.
But English is very flexible, especially in "command syntax" and From first is surprisingly common: "From the middle cupboard, grab a plate". SQL trying to sound like English here only shows how inflexible it still is in comparison to actual English.
I've been using C#'s LINQ since it was added to the language in 2007 and the from/where/join/group by/select order feels great, is very legible especially because it gives you great autocomplete support, and troubleshooting is easier than people think.
If you engage the syntax with your System 2 thinking (prefrontal cortex, slow, the part of thinking we're naturally lazy to engage) rather than System 1 (automated, instinctual, optimized brain path to things we're used to) you'll most likely find that it is simpler, makes more logical sense so that you're filtering down things naturally like a sieve and composes far better than SQL as complexity grows.
After you've internalized that, imagine the kind of developer tooling we can build on top of that logical structure.
You might not have intended it this way, but your choice of phrasing is very condescending.
It quite interesting to dive into history of SQL alternatives in 70x/80x.
Also, tools can trivially tell DQL from DML by the first word they encounter, barring data-modifying functions (o great heavens, no!).
Bikeshedding par excellence.
This particular post quickly turned into a very thinly veiled excuse for me to complain about PDFs, then demonstrate a Gemini Pro trick.
In this case I converted to HTML - I've since tried converting a paper to Markdown and sharing in a Gist, which I think worked even better: https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdab... - notes here https://simonwillison.net/2024/Aug/27/distro/
If you replace `gist.github.com/<user>/<id>` -> `https://gist.io/@<user>/<id>`, you get a gist with nice typography.
https://gist.io/@simonw/46a33d66e069efe5c10b63625fdabb4e is the same gist you linked, but nicer to read
There was a talk at the time, but I can't find the video: http://jaoo.dk/aarhus2007/presentation/Using+LINQ+to+SQL+to+....
Basically, it was a way to cleanly plug SQL queries into C# code.
It used this sort of ordering (where the constraints come after the thing being constrained); it needed to do so for IntelliSense to work.
And FROM-first syntax absolutely makes more sense, regardless of autocomplete. You should put the "what I need to select" after the "what I'm selecting from", in general.
"Short lived"? Its still alive, AFAIK, and the more popular newer thing for the same use case, Linq to Enntities, has the same salient features but (because it is tied to Entity Framework and not SQL Server specific) is more broadly usable.
If they've replaced it with something else in the last decade and a half that does not mean that they didn't get rid of it, or that it wasn't short lived.
https://learn.microsoft.com/en-us/dotnet/framework/data/adon...
Of course there's EF Core too.
Also, there are fun things that support Linq syntax for non-ORM uses, too, such as System.Reactive.Linq and LanguageExt: https://github.com/louthy/language-ext/wiki/How-to-deal-with...
Google has now proposed a syntax inspired by these approaches. However, I am afraid how well it would be adopted. As someone new to SQL, nearly every DB seem to provide its own SQL dialect which becomes cumbersome very quickly.
Whereas PRQL feels something like Apache Arrow which can map to other dialects.
I did my PhD more than 20 years ago and it was annoying then to be working with all these postscript and pdf documents. It's still annoying. These days people publish content in PDF form on websites and mostly not in printed media. People might print these or not. Twenty years ago, I definitely did. But it's weird how we stick with this. And PDFs are of course very unstructured and hard to make sense of programmatically as well.
I bet a lot of modern day scientists don't actually print the articles they read anymore and instead read them on screen or maybe on some ipad or e-reader. Print has become an edge case. Reading a pdf on a small e-reader is not ideal. Anything with columns is kind of awkward to deal with. There's a reason why most websites don't use columns: it kind of sucks as a UX. The optimal form to deliver text is in a responsive form that can adapt to any screen size where you can change the font size as well. A lot of scientific paper layouts are optimized to conserve a resource that is no longer relevant: paper real estate. Tiny fonts, multiple columns, etc.
Anyway, I like Simon's solution and how it kind of works. It's kind of funny how some of these LLMs can be so lazy. The thing with the references being omitted is hilarious. I see the same with chat gpt where it goes out of its way to never do exactly as you asked and instead just give you bits and pieces of what you ask for until you beg it to just please FFing do as you're told?! I guess they are trying to save some tokens or GPU time.
The one-column format is fine on a large monitor, but on a small phone I prefer narrower columns, because a wide column would either make the text too small or it would require horizontal panning while reading.
So I consider the two-column format as better for phones, not worse.
Also it would make autocomplete in intelligent IDEs much more helpful when typing a query out from nothing.
"users" |> where([u], u.age > 18) |> select([u], u.name)
I pointed out that you can do this with shell:
Pipelines Support Vectorized, Point-Free, and Imperative Style https://www.oilshell.org/blog/2017/01/15.html
e.g.
hist() {
sort | uniq -c | sort -n -r
}
$ { echo a; echo bb; echo a; } | hist
1 bb
2 a
$ foo | hist
...
Something like that should be possible in SQL!There's an example at the bottom of this file:
https://github.com/google/zetasql/blob/master/zetasql/exampl...
What about scalar valued functions? :) So I can reuse an expression in a WHERE and so forth
(and I appreciate that HAVING can be generalized/removed)
GROUP AND ORDER BY component_id DESC;
Is this kind of syntax combining grouping and ordering really necessary in addition the pipe operator? My advice would be to add the pipe operator and not get fancy adding other syntax to SQL as well. flights |>
filter(
carrier == "UA",
dest %in% c("IAH", "HOU"),
sched_dep_time > 0900,
sched_arr_time < 2000
) |>
group_by(flight) |>
summarize(
delay = mean(arr_delay, na.rm = TRUE),
cancelled = sum(is.na(arr_delay)),
n = n()
) |>
filter(n > 10)
If you haven't used R, it has some serious data manipulation legs built into it. d |> filter(id==max(id),.by=orderId)
I think PRQL were thinking a bit about ways to avoid a group_by operation and I think what they have is a kind of ‘scoped’ or ‘higher order’ group_by operation which takes your grouping keys and a pipeline and outputs a pipeline step that applies the inner pipeline to each group.https://leanpub.com/combinators/read#leanpub-auto-the-thrush
Being a concept which transcends programming languages, a search for "thrush combinator" will yield examples in several languages.
[1] https://github.com/raganwald-deprecated/homoiconic/blob/mast...
For example, given the expression:
f (g (h (x)))
The same can be expressed in languages which support the "|>" infix operator as: h (x) |> g |> f
There are other, equivalent, constructs such as the Cats Arrow[0] type class available in Scala, the same Arrow[1] concept available in Haskell, and the `andThen` method commonly available in many modern programming languages.Right now, it's everyone faffing around with different mental models and ugly single pass compilers (my understanding is that parsing-->query planning is not nearly as well-separated in most DBs as parsing-->optomize-->codegen in most compilers).
Do you mean something other than ISO/IEC 9075:2023 (the 9th edition of the SQL standard)?
Syntax/DSL designers: if your language uses a separator for anything, please kindly allow trailing versions of that separator anywhere possible.
INSERT INTO table FORMAT JSONEachRow {"key": 123}
It works with all other formats as well.Plus, it is designed in a way so you can make an INSERT query and stream the data, e.g.:
clickhouse-client --query "INSERT INTO table FORMAT Protobuf" < data.protobuf
curl 'https://example.com/?query=INSERT...' --data-binary @- < data.bsonhttps://github.com/kiranandcode/petrol
An example query being:
```
let insert_person ~name:n ~age:a db = Query.insert ~table:example_table ~values:Expr.[ name := s n; age := i a ] |> Request.make_zero |> Petrol.exec db
```
FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) c_count
GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;
You could do something similar with Ryelang's spreadsheet datatype: customers: load\csv %customers.csv
orders: load\csv %orders.csv
orders .where-not-contains 'o_comment "unusual packages"
|left-join customers 'o_custkey 'c_custkey
|group-by 'c_custkey { 'c_custkey count }
|group-by 'c_custkey_count { 'c_custkey_count count }
|order-by 'c_custkey_count_count 'descending
Looking at this, maybe we should add an option to name the new aggregate column (now they get named automatically) in group-by function because c_custkey_count_count is not that elegant for example.I like the syntax for reading what the statement expects to output first, even though I agree that I don’t write them select first. I feel like this might be optimizing the wrong thing.
Although the example is nice, it does not show 20 tables joined first, which will really muddle it.
1. Calling it "RETURN" makes the fact of its later order of execution (relative to FROM etc) less surprising.
2. "RETURN RAND()" just reads more naturally than "SELECT RAND()". After all, we're not really "selecting" anything here, are we?
3. Would also eliminate any confusion with the selection operation in relational algebra.
{:select [:name :age]
:from {:people :p}
:where [:> :age 10]}
Since maps are unordered, this is equivalent to {:from {:people :p}
:select [:name :age]
:where [:> :age 10]}
and also {:where [:> :age 10]
:select [:name :age]
:from {:people :p}}
These can all be rendered to 'SELECT... FROM' or 'FROM .. SELECT'.Queries as data structures are very versatile, since you can use the language constructs to compose them.
Queries as strings (FROM-first or not) are still strings which are hard to compose without breaking the syntax.
This feels like too much. GROUP BY and ORDER BY are separate clauses, and creating a way to group (heh) them in one clause complicates cognitive load, especially when there is an effort to reduce the overall effort to parse the query in your mind (and to provide a way for an intellisense-like system a way to make better suggestions).
GROUP AND ORDER BY x DESC;
vs GROUP BY x;
ORDER BY x DESC;
This long form is 1 word longer, but, it easier to parse in your mind, and doesn't introduce unneeded diffs when changing either the GROUP or the ORDER BY column reference. |>
What IS that thing? A unix pipe that got confused with a redirect? A weird smiley of a bird wearing sunglasses?It'll take some getting used to, for me...
Many Programming Ligature fonts even often draw it that way. For instance it is shown under F# in the Fira Code README: https://github.com/tonsky/FiraCode
I think this is bad rationale. Having the columns in order is much more important than having neat syntax for full-table aggregation.
If the DB engine is executing the statement out of order, why not allow the statement to be written in any order and let itself figure it out?
It's inspired by Kusto and available as an open-source CLI. I've made it compatible with SQLite in one of my tools, and it's refreshing to use.
An example:
StormEvents
| where State startswith "W"
| summarize Count=count() by StateUSING THE NEW SQL PROCEDURE IN SAS PROGRAMS (1989) https://support.sas.com/resources/papers/proceedings-archive... The Sql procedure uses SQL to create, modify, and retrieve data from SAS data sets and views derived from those data sets. You can also use the SOL procedure to join data sets and views with those from other database management systems through the SAS/ACCESS software interfaces.
[1] https://elixirschool.com/en/lessons/basics/pipe_operator
Pipe Syntax in SQL - https://news.ycombinator.com/item?id=41338877 - Aug 2024 (219 comments)
Is it though?
Are we trying to solve the human SQL parser and generator problem or there is some underlying implementation detail that benefits from pipes?
These queries are always hand-rolled because you pay the analysts to optimize them.
SQL is fine.
SQL has been the state of the art for db queries for 40 years.
And it will continue to be when we all retire.
LINQ: exists
Splunk query language: exists
KQL: exists
MongoDB query language: exists
PRQL: exists
PRQL is the only real entrant in your list when it comes to adding a pipelining syntax to a language for relational queries in a way that others can freely build on.
The paper clearly describes the goal: add a pipe syntax into existing systems with minor changes and be compatible with existing SQL queries.
BTW: LINQ is an AST transformer not a language per se tied to a particular platform. None of existing DBs allows to use it directly.
// but let's change it to *int ptr;
// because the pointer symbol is more logical to write first
Please can we solve a real problem instead?
https://cs.brown.edu/~sk/Publications/Papers/Published/rk-st...