It turns out that the dialect doesn't support LATERAL joins with a LIMIT in them. The below query only works if you remove the LIMIT clause.
https://i.stack.imgur.com/rdB1s.png
This makes saying things like "Fetch all artists where ..., for each artist fetch their first 3 albums where ..., and for each album fetch the top 10 tracks where ..." really difficult
Does Trino support this out of curiosity?
If that's not peak Amazon, I don't know what is.
It's one of the backends available in Splink, our FOSS record linkage software and it's revolutionary how it allows users to execute large scale probabilistic record linkage ridiculously cheaply. It wasn't long ago you needed very expensive proprietary software plus a big on prem cluster, costing in the hundreds of thousands, to achieve this.
A lot of the magic for me is on the infrastructure side: how they can read/write large datasets from s3 so quickly, so the value isn't just in the SQL engine.
trino> USE memory.default;
USE
trino:default> create table artist (artistid int);
CREATE TABLE
trino:default> create table album (albumid int, artistid int);
CREATE TABLE
trino:default> insert into artist values 1, 2;
INSERT: 2 rows
Query 20220804_182827_00005_n4rat, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0.52 [0 rows, 0B] [0 rows/s, 0B/s]
trino:default> insert into album values (11, 1), (12, 1), (21, 2);
INSERT: 3 rows
Query 20220804_182857_00006_n4rat, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0.18 [0 rows, 0B] [0 rows/s, 0B/s]
trino:default> select * from (select * from artist limit 2) a cross join lateral (select * from album where album.artistid = a.artistid limit 2);
artistid | albumid | artistid
----------+---------+----------
1 | 12 | 1
1 | 11 | 1
2 | 21 | 2
(3 rows)
Query 20220804_182930_00007_n4rat, FINISHED, 1 node
Splits: 41 total, 41 done (100.00%)
0.35 [8 rows, 232B] [22 rows/s, 661B/s]Note: the original Presto continues to run in production at Meta (fmr. Facebook), Uber, and recently ByteDance TikTok data platform talked about running 1M queries a day with tens of thousands of cores. Some reasons to stay with Presto: - Reliability and scalability per above - Cutting edge innovations only in later versions of Presto: multi-level caching (project RaptorX) to boost query performance by 10X+ and table scan improvements (project Aria) to name a few - Only PrestoDB is hosted by Linux Foundation, giving confidence to community users that future releases will remain open.
https://stackoverflow.com/a/73129836/13485494
But man is it a huge PITA (especially when doing programmatic code generation of the SQL) compared to LATERAL joins
Someone familiar with the CockroachDB query planner showed me that a window function like this is what Cockroach turns LATERAL joins into for instance:
demo@127.0.0.1:26257/movr> explain select * from abc, lateral (select * from xyz where x = a limit 2);
• filter
│ estimated row count: 1
│ filter: row_num <= 2
│
└── • window
│ estimated row count: 2
│
└── • hash join
│ estimated row count: 2
│ equality: (x) = (a)
│
├── • scan
│ estimated row count: 6 (100% of the table; stats collected 2 minutes ago)
│ table: xyz@xyz_pkey
│ spans: FULL SCAN
│
└── • scan
estimated row count: 1 (100% of the table; stats collected 3 minutes ago)
table: abc@abc_pkey
spans: FULL SCAN- https://github.com/bitsondatadev
- https://www.linkedin.com/in/bitsondatadev/
I recommend the Trino Slack for people not already in it: https://trino.io/slack.html
If you want to get started with Trino, here's a repo I created to do so: https://github.com/bitsondatadev/trino-getting-started
https://trino.io/blog/2022/08/02/leaving-facebook-meta-best-...
BTW, purely curiosity, I compared Trino with Presto from OSS point of view (https://ossinsight.io/analyze/prestodb/presto?vs=trinodb%2Ft...), both communities are still popular but Trino seems more active than Presto now. I also wonder if two communities may reunion someday again to really boost its impact (comparing to Spark community).
- https://engineering.salesforce.com/how-to-etl-at-petabyte-sc... - https://shopify.engineering/faster-trino-query-execution-inf... - https://trino.io/episodes/33.html - https://www.youtube.com/watch?v=-5mlZGjt6H4
All use the Lyft "Presto but really Trino"-Gateway project to run different clusters to handle various workloads. They go into various details for how this is achieved.
https://github.com/lyft/presto-gateway
Regarding the Trino/Presto split. I recommend looking at this blog to better understand why these two communities aren't mergeing. TL;DR Presto is a Facebook-driven project that mainly considers running on the Facebook infrastructure. Trino is community-driven that works on running well with all clouds and common infasturcture in the Trino community which is why you see a higher velocity there.
https://trino.io/blog/2022/08/02/leaving-facebook-meta-best-... https://trino.io/blog/2020/12/27/announcing-trino.html
Soon we anticipate that Trino will become the common name in the community space but we'll always love the origins of the Trino project being Presto.
For all intents and purposes, both projects are active and lively. It seems that Trino is more focused on federation and building out connectors. Presto is more focused on being the engine for the data lake/lakehouse. Both projects are doing well and solving different problems. There's been a lot of innovative features in the Presto project over the last year that are only in Presto, like Presto-on-Spark, disaggregated coordinator, Project Aria, etc. In fact we just hosted a fantastic user conference a few weeks ago that showcased a lot of that innovation and how companies are using Presto at massive scale today (if interested, check out the sessions: https://www.youtube.com/watch?v=Gi8i7eHqwyw&list=PLJVeO1NMmy...)
Long story short, Presto is alive and well, is not solely backed by 1 company (quite the opposite of Trino/Starburst), and has a lot of tech innovation on the roadmap. We're excited about the future of Presto.
https://trino.io/blog/2022/06/30/trino-summit-call-for-speak...
To enable the users to connect to their databases... we have a form that collects the database credentials from the user, saves it in a secure way, and when the user writes or uses an SQL query, we establish a database connection right away (from our server), execute it, and return the results, and we keep the connection alive for like 15mins.
But with serverless architecture, first query could go to instance 1, so instance 1 will establish a db connection, then the second query could go to instance 2, so instance 2 will establish another one. You could end up with a lot of unnecessary connections.
If you use AWS RDS (for yourself), beside lambda for example, AWS have RDS Proxy to solve this problem.
So I was thinking about using Trino like the RDS Proxy, but for more databases, and for our customers database, not ours. Is that doable with Trino?
It would be super good if you guys added big query write support. Its really annoying to have to run a hive cluster in google to act as a proxy for this.
But, yes, we do plan to add that eventually after ironing out all the kinks. See https://github.com/trinodb/trino/pull/13094
The other thing I would say is that Trino and Presto are not one-trick ponies or just hive replacements. There's also the ability to query across multiple systems that is, to me, the feature that future proofs a lot of architectures. It inherently frees you up to fiddle with your data in different systems but keep the access to that system in one location.
Apart from implementation details, probably not much different. It is similar to mysql vs postgresql. You are probably okay with either.
I agree but it depends a bit on what purpose you are using them for. If you mainly use the tool to JOIN some data in bulk and then write output somewhere else (i.e. ETL) - either will serve you fine.
If you write complex queries with multiple filters and want to JOIN across multiple datasets - sure Spark can do that as well but it's not as efficient in pushing down computation to the source.
e.g. A query like SELECT c.custkey, sum(totalprice) FROM orders o INNER JOIN customer c ON o.custkey = c.custkey WHERE o.orderstatus = 'O' GROUP BY c.custkey; when ran on Spark will pull both tables into memory and then perform the join + filter for orderstatus = 'O' and then compute the sum.
While in case of Trino it'll push down the entire query into the remote database (in this case, in other queries it'll push down some parts of the query) so the source database will not need to return gigabytes of data over the network every time the query runs (and hence finish faster as well).
Trino tries to push-down some operations to the remote system which can be done more efficiently there. e.g. filtering on a column that has an index in the remote RDBMS will be faster than pulling all data and then filtering in Trino. Spark doesn't have strong pushdown and has to pull most of the raw data and then apply processing on top of it.
That's one of the main differences. Spark is a distributed job execution framework first while Trino is a distributed federated query engine first and it shows in their strengths and weaknesses.
If you want to run arbitrary user defined transformations on data then Spark definitely has much more to offer than Trino.
My suggestion is to try both under your own workloads and see the difference. Trino is also used by products like Athena (AWS) and Galaxy (Starburst) so if you want to play around and see how Trino performs without spending too much time on setting up clusters on your own, you can try these great products.
Having said that, I'd like to add that building a performant distributed query engine is just hard. Trino has been in development for ten years and used by major companies in very demanding environments, these environments is where the technology has been defined and makes it what it is today and it is a proof of its performance and stability.
(edited to add an important disclaimer that I work at Starburst)
Realtime is generally more expensive to run as you process every individual row as it comes, batch is when you can deal with minute latency and want to handle a lot of data in chunks.
Trino is also a query engine rather than a database and it connects to many different systems: https://trino.io/docs/current/connector.html
It also happens to connect to Clickhouse and it's very common that people will use Trino to query clickhouse realtime data and join it with data in big query, an object store data lake, or Snowflake: https://trino.io/docs/current/connector/clickhouse.html
can't believe this shit is free as in freedom