DataFusion Comet: Apache Spark Accelerator (opens in new tab)

(github.com)

107 pointsandygrove1y ago25 comments

25 comments

But why. Unless you need to use low-level map/reduce, just ditch Spark and use https://github.com/apache/datafusion-ballista directly. It supports Python too.

simicd1y ago

In short: Compatible with existing Spark jobs but executing them much faster. Benchmarks in the README file and docs [1] show improvements up to 3x while not even all operations are implemented yet (i.e. if an operation is not available in Comet it falls back to Spark), so there is room for further improvements. Across all TPC-H queries the total speedup is currently 1.5x, the docs state that based on datafusion's standalone performance 2x-4x is a realistic goal [1]

Haven't seen any memory consumption benchmarks but suspect that it's lower than Spark for same jobs since datafusion is designsd from the ground up to be columnar-first.

For companies spending 100s of thousands if not millions on compute this would mean substantial savings with little effort.

[1] https://datafusion.apache.org/comet/contributor-guide/benchm...

MrPowers1y ago

The OP is the original creator of Ballista, so he's well aware of the project.

Ballista is much less mature than Spark and needs a lot of work. It's awesome they're making Spark faster with Comet.

andygroveOP1y ago

Yes, Ballista failed to gain traction. I think that one of the challenges was that it only supported a small subset of Spark, and there was too much work involved to try and get to parity with Spark.

The Comet approach is much more pragmatic because we just add support for more operators and expressions over time and fall back to Spark for anything that is not supported yet.

threeseed1y ago

One of the challenges is that most Spark users don't care if you 2x performance.

We are in the enterprise with large cloud budgets and can simply change instance types. If you're 20x then that is a different story but then (a) you need to have feature parity and (b) need support from cloud vendors which Spark has.

OutOfHere1y ago

For the longest time, searching for Ballista linked to its old archived repo that didn't even have a link to the new repo. There was no search result for the new repo. This misled people into thinking that Ballista is a dead project but it wasn't. It wasted so much opportunity.

I don't think it's a fair criticism of Ballista to say that it failed in any way. It just looks to need substantial effort to bring it on par with Spark. The performance benefits are meaningful. Ballista can then not only take the crown from Spark, but also revalidate Rust as a language.

1 more reply

spenczar51y ago

There seems to be a history of data technologies requiring a serious corporate sponsor. Arrow gets so much dev and marketing effort from Voltron, Spark from Databricks, etc. Did Ballista have anything’s similar? I loved the project but it never seemed to move very fast on integrating with other tools and platforms.

orthoxerox1y ago

Because it's a drop-in replacement that lets you (theoretically) spend O(1) development effort on speeding up your Spark jobs instead of O(N).

I say theoretically, because I have no idea how Comet works with the memory limits on Spark executors. If you have to rebalance the memory between regular memory and memory overhead or provision some off-heap memory for Comet, then the migration won't be so simple.

nicornk1y ago

and this is unfortunately the case for Velox.

_zoltan_1y ago

care to elaborate?

necubi1y ago

Many companies have 100k+ of lines of Spark code. It's not trivial to rewrite all of that in another query framework.

OutOfHere1y ago

Following that logic, we should have stuck with C/C++ for everything. /s

MrPowers1y ago

Lots of Spark workloads are executed with the C++ Photon engine on the Databricks platform, so we ironically have partially moved back to C++. Disclosure: I work for Databricks.

1 more reply

threeseed1y ago

There is an entire ecosystem of libraries for Spark built up over years.

I want to be able to connect to interact with the full services from GCS, Azure, AWS, OpenAI etc none of which DataFusion supports.

As well as use libraries such as SynapseML, SparkNLP etc.

And do all of this with full support from my cloud provider.

pitah11y ago

I've been keeping an eye on these kinds of Spark accelerator libraries for a while now.

How does it compare to Blaze[1] and Gluten[2]?

I'm interested in running some benchmarks soon against all three for my project to see how they all go.

[1] https://github.com/kwai/blaze

[2] https://github.com/apache/incubator-gluten

ed_elliott_asc1y ago

Apparently blaze is also datafusion

scirob1y ago

Imagine if data bricks switched and just started to contribute to this.

I live in a dream world :)

nevi-me1y ago

They have their own implementation that is closed source (last time I checked), Photon [1], which is written with a C++ engine.

Databricks' terms prevent(ed?) publishing benchmarks, it would be interesting to see how Comet performs relative to it over time.

Photon comes at a higher cost, so one big advantage of Comet is being able to deploy it on a standard Databricks cluster that doesn't have Photon, at a lower running cost.

[1] https://www.databricks.com/product/photon

slt20211y ago

nice word play for two competing spark execution engines: Photon and Comet. C++ vs Rust, closed-source vs open-source

j / k navigate · click thread line to collapse

25 comments

OutOfHere1y ago

But why. Unless you need to use low-level map/reduce, just ditch Spark and use https://github.com/apache/datafusion-ballista directly. It supports Python too.

simicd1y ago

Haven't seen any memory consumption benchmarks but suspect that it's lower than Spark for same jobs since datafusion is designsd from the ground up to be columnar-first.

For companies spending 100s of thousands if not millions on compute this would mean substantial savings with little effort.

[1] https://datafusion.apache.org/comet/contributor-guide/benchm...

MrPowers1y ago

The OP is the original creator of Ballista, so he's well aware of the project.

Ballista is much less mature than Spark and needs a lot of work. It's awesome they're making Spark faster with Comet.

andygroveOP1y ago

Yes, Ballista failed to gain traction. I think that one of the challenges was that it only supported a small subset of Spark, and there was too much work involved to try and get to parity with Spark.

The Comet approach is much more pragmatic because we just add support for more operators and expressions over time and fall back to Spark for anything that is not supported yet.

threeseed1y ago

One of the challenges is that most Spark users don't care if you 2x performance.

OutOfHere1y ago

1 more reply

spenczar51y ago

orthoxerox1y ago

Because it's a drop-in replacement that lets you (theoretically) spend O(1) development effort on speeding up your Spark jobs instead of O(N).

nicornk1y ago

and this is unfortunately the case for Velox.

_zoltan_1y ago

care to elaborate?

necubi1y ago

Many companies have 100k+ of lines of Spark code. It's not trivial to rewrite all of that in another query framework.

OutOfHere1y ago

Following that logic, we should have stuck with C/C++ for everything. /s

MrPowers1y ago

Lots of Spark workloads are executed with the C++ Photon engine on the Databricks platform, so we ironically have partially moved back to C++. Disclosure: I work for Databricks.

1 more reply

threeseed1y ago

There is an entire ecosystem of libraries for Spark built up over years.

I want to be able to connect to interact with the full services from GCS, Azure, AWS, OpenAI etc none of which DataFusion supports.

As well as use libraries such as SynapseML, SparkNLP etc.

And do all of this with full support from my cloud provider.

pitah11y ago

I've been keeping an eye on these kinds of Spark accelerator libraries for a while now.

How does it compare to Blaze[1] and Gluten[2]?

I'm interested in running some benchmarks soon against all three for my project to see how they all go.

[1] https://github.com/kwai/blaze

[2] https://github.com/apache/incubator-gluten

ed_elliott_asc1y ago

Apparently blaze is also datafusion

scirob1y ago

Imagine if data bricks switched and just started to contribute to this.

I live in a dream world :)

nevi-me1y ago

They have their own implementation that is closed source (last time I checked), Photon [1], which is written with a C++ engine.

Databricks' terms prevent(ed?) publishing benchmarks, it would be interesting to see how Comet performs relative to it over time.

Photon comes at a higher cost, so one big advantage of Comet is being able to deploy it on a standard Databricks cluster that doesn't have Photon, at a lower running cost.

[1] https://www.databricks.com/product/photon

slt20211y ago

nice word play for two competing spark execution engines: Photon and Comet. C++ vs Rust, closed-source vs open-source

j / k navigate · click thread line to collapse