undefined | Better HN

0 pointsdisgruntledphd25y ago0 comments

Huh, what replaces Spark in those lists?

For my money, its the best distributed ML system out there, so I'd be interested to know what new hotness I'm missing.

0 comments

distributed ML != Distributed DWH.

Distributed ML is tough to train because of very little control over train loop. I personally prefer using single server trainkng even on large datasets, or switch to online learning algos that do train/inference/retrain at the same time.

as for snowflake, I havent heard of people using snowflake to train ML, but sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune

mrslave5y ago

> sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune

How do Snowflake (and Redshift, mentioned above) compare with CitusDB? I really like the PostgreSQL experience offered by Citus. I've been bit by too many commercial databases where the sales brochure promises the product does X, Y, and Z, only to discover later that you can't do any of them together because reasons.

disgruntledphd2OP5y ago

So do I, theoretically at least.

But Spark is super cool and actually has algorithms which complete in a reasonable time frame on hardware I can get access to.

Like, I understand that the SQL portion is pretty commoditised (though even there, SparkSQL python and R API's are super nice), but I'm not aware of any other frameworks for doing distributed training of ML models.

Have all the hipsters moved to GPUs or something? \s

> sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune

It's so very expensive though, and their pricing model is frustratingly annoying (why the hell do I need tickets?).

That being said, tuning Spark/Presto or any of the non-managed alternatives is no fun either, so I wonder if it's the right tradeoff.

One thing I really, really like about Spark is the ability to write Python/R/Scala code to solve the problems that cannot be usefully expressed in SQL.

All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.

marcinzm5y ago

>I'm not aware of any other frameworks for doing distributed training of ML models.

Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet all support distributed training across CPUs/GPUs in a single machine or multiple machines. So does XGBoost if you don't want deep learning. You can then run them with KubeFlow or on whatever platform your SaaS provider has (GCP AI Platform, AWS Sagemaker, etc.).

edit:

>All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.

Snowflake has support for custom Javascript UDFs and a lot of built in features (you can do absurd things with window functions). I also found it much faster than Spark.

1 more reply

kenhwang5y ago

I'd put spark in both lists. Old is spark-sql, new is the programming language interface.

sails5y ago

Snowflake I suppose for the average ML use case. Not for your high-performance ML, but for your average data scientist, maybe?

Edit: I may be wrong[1], would be curious to know what users who've used Spark AND Snowflake would add to the conversation.

[1] https://www.snowflake.com/blog/snowflake-and-spark-part-1-wh...

marcinzm5y ago

Snowflake hits its limits with complex transformations I feel. Not just due to using SQL. It's "type system" is simpler than Spark's which makes certain operations annoying. There's a lack of UDFs for working with complex types (lists, structs, etc.). Having to write UDFs in Javascript is also not the greatest experience.

dominotw5y ago

> There's a lack of UDFs for working with complex types (lists, structs, etc.). Having to write UDFs in Javascript is also not the greatest experience.

We load our data into SF in json and do plenty of list/struct manipulation using their inbuilt functions[1]. I guess you might have write a UDF if you are doing something super weird but inbuilt functions should get you pretty far 90% of the time.

https://docs.snowflake.com/en/sql-reference/functions-semist...

dominotw5y ago

> best distributed ML system out there

I was comparing it for "traditional" data engineering stack that used spark for data munging, transformations ect.

I don't have much insight into ML systems or how spark fits there. Not all data teams are building 'ml systems' though. Parent comment wasn't referring to any 'ml systems', not sure why that would be automatically inferred when someone mentions data stack .

disgruntledphd2OP5y ago

Yeah, I suppose. I kinda think that distributed SQL is a mostly commoditised space, and wondered what replaced Spark for distributed training.

For context, I'm a DS who's spent far too much time not being able to run useful models because of hardware limitations, and a Spark cluster is incredibly good for that.

Additionally, I'd argue in favour of Spark even for ETL, as the ability to write (and test!) complicated SQL queries in R, Python and Scala was super, super transformative.

We don't really use Spark at my current place, and every time I write Snowflake (which is great, to be fair), I'm reminded of the inherent limitations of SQL and how wonderful Spark SQL was.

I'm weird though, to be fair.

victor1065y ago

I agree with this.

Along with ML it is also a very high performance extract and transformation engine.

Would love to hear what other tech that are being used to replace Spark.

j / k navigate · click thread line to collapse

0 comments

somurzakov5y ago

distributed ML != Distributed DWH.

as for snowflake, I havent heard of people using snowflake to train ML, but sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune

mrslave5y ago

> sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune

disgruntledphd2OP5y ago

So do I, theoretically at least.

But Spark is super cool and actually has algorithms which complete in a reasonable time frame on hardware I can get access to.

Have all the hipsters moved to GPUs or something? \s

> sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune

It's so very expensive though, and their pricing model is frustratingly annoying (why the hell do I need tickets?).

That being said, tuning Spark/Presto or any of the non-managed alternatives is no fun either, so I wonder if it's the right tradeoff.

One thing I really, really like about Spark is the ability to write Python/R/Scala code to solve the problems that cannot be usefully expressed in SQL.

All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.

marcinzm5y ago

>I'm not aware of any other frameworks for doing distributed training of ML models.

edit:

>All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.

Snowflake has support for custom Javascript UDFs and a lot of built in features (you can do absurd things with window functions). I also found it much faster than Spark.

1 more reply

kenhwang5y ago

I'd put spark in both lists. Old is spark-sql, new is the programming language interface.

sails5y ago

Snowflake I suppose for the average ML use case. Not for your high-performance ML, but for your average data scientist, maybe?

Edit: I may be wrong[1], would be curious to know what users who've used Spark AND Snowflake would add to the conversation.

[1] https://www.snowflake.com/blog/snowflake-and-spark-part-1-wh...

marcinzm5y ago

dominotw5y ago

> There's a lack of UDFs for working with complex types (lists, structs, etc.). Having to write UDFs in Javascript is also not the greatest experience.

https://docs.snowflake.com/en/sql-reference/functions-semist...

dominotw5y ago

> best distributed ML system out there

I was comparing it for "traditional" data engineering stack that used spark for data munging, transformations ect.

disgruntledphd2OP5y ago

Yeah, I suppose. I kinda think that distributed SQL is a mostly commoditised space, and wondered what replaced Spark for distributed training.

For context, I'm a DS who's spent far too much time not being able to run useful models because of hardware limitations, and a Spark cluster is incredibly good for that.

Additionally, I'd argue in favour of Spark even for ETL, as the ability to write (and test!) complicated SQL queries in R, Python and Scala was super, super transformative.

We don't really use Spark at my current place, and every time I write Snowflake (which is great, to be fair), I'm reminded of the inherent limitations of SQL and how wonderful Spark SQL was.

I'm weird though, to be fair.

victor1065y ago

I agree with this.

Along with ML it is also a very high performance extract and transformation engine.

Would love to hear what other tech that are being used to replace Spark.

j / k navigate · click thread line to collapse