1. Required: Write a query that joins an event stream with a historical table in Snowflake 2. Required: Executes in near-real time < 5s even if a query involves 300M rows 3. Highly desired: Gives me a way of doing dbt-like DAGs, where I can execute a DAG of actions (including external api calls) based on results of the query 4. Highly desired: allows me to write queries in standard SQL 5. Desired: true real time (big queries executing w/ subsecond latency)
What are the best options out there? It seems like Apache Flink enables this, but there also seem to be a number of other projects out there that may enable some or all of what I'm describing, including:
- kSQL - Arroyo - Proton - Kafka Streams - Snowflake's Snowpipe Streaming - Benthos - RisingWave - Spark Streaming - Apache Beam - Timely Dataflow and derivatives (Materialize, Bytewax, etc.)
Any recommendations on the best tool for the job? Are there interesting alternatives that I haven't named?