PySpark Style Guide (opens in new tab)

(github.com)

62 pointsfmsf5y ago26 comments

26 comments

Here's the Scala Spark style guide: https://github.com/MrPowers/spark-style-guide

The chispa README also provides a lot of useful info on how to properly write PySpark code: https://github.com/MrPowers/chispa

Scala is easier than Python for Spark because it allows functions with multiple argument lists and isn't whitespace sensitive. Both are great & Spark is a lot of fun.

Some specific notes:

> Doing a select at the beginning of a PySpark transform, or before returning, is considered good practice

Manual selects ensure column pruning is performed (column pruning only works for columnar file formats like Parquet). Spark does this automatically and always manually selecting may not be practical. Explicitly pruning columns is required for Pandas and Dask.

> Be careful with joins! If you perform a left join, and the right side has multiple matches for a key, that row will be duplicated as many times as there are matches

When performing joins, the first thing to think about is if a broadcast join is possible. Joins on clusters are hard. Then it's good to think about using a data stores that allows for predicate pushdown aggregations.

choppaface5y ago

Agree that they missed broadcast joins, which can greatly impact how you’d go about a query versus plain SQL for big data. One of the best parts about Spark is how it supports rapid iteration—- you can use it to discover what joins are computationally infeasible.

It’s notable that in Spark 3.x, Koalas is standard, which adopts the Pandas API. Yet this style guide uses the Spark DataFrame API. So the guide might be a little stale anyways.

In my experience, it’s helpful to write queries in plain portable (or mostly portable) SQL, because once a Spark job becomes useful it often gets translated or refactored into something else. Definitely depends on the team / context, but plain SQL is often more widely accessible. For fast-moving data science stuff, it’s important to think about extensibility in terms of not just code (style & syntax) but people (who is going to remix this?).

MrPowers5y ago

I'd argue Koalas is an anti-pattern but will have to justify that in a blog post ;)

I've written some popular Scala Spark (https://github.com/MrPowers/spark-daria) and PySpark (https://github.com/MrPowers/quinn) libraries that have been adopted by a variety of teams. Not sure how to make a reusable with pure SQL, but sounds like it's possible. Send me a code snippet or link if you have anything I can take a look at to learn more about your approach.

1 more reply

speedgoose5y ago

I wonder whether I should read about best practices from Palantir.

legerdemain5y ago

Fun fact: Palantir developers wrote what was, for a time, the most starred TypeScript linter (until it was decided to unify it with ESLint). They were working on what was supposed to be the world's largest TS monorepo (at least according to the MS rep).

nxpnsv5y ago

They'll know whether you read it or not...

fmsfOP5y ago

There is also a blog post on this: https://medium.com/palantir/a-pyspark-style-guide-for-real-w...

em5005y ago

I worked quite a lot in pandas, dplyr, data.table and pyspark for a few years. And even occasionally some scala spark and sparkR. But after getting a bit fed up with F.lit()-this, F.col()-that, and the umpteenth variation on SQL, nowadays I pretty much just stick with plain SQL. I believe I've found my Enlightenment.

tandav5y ago

I have opposite experience. After trying pyspark functional pipelines (so many handy functions) plain SQL seems so hard to read/understand. The main probem is that order of execution is not equal to order of code lines. https://i.stack.imgur.com/6YuwE.jpg

another thing is that python is so cool for data processing, and when working with plain sql I feel lack of

    .rdd.map(my_python_processing_function)

MrPowers5y ago

Same for me. Python and Scala let users break up the logic into DataFrame transformations that can be unit tested, packaged into Wheel / JAR files, and easily reused in multiple contexts. Maintaining big, complex SQL codebases isn't easy.

iblaine5y ago

Could not agree more. Similar to the The Principle of Least Privilege [1], I prefer to use SQL over pyspark if possible. [1] https://us-cert.cisa.gov/bsi/articles/knowledge/principles/l...

slotrans5y ago

I'm so confused.

These examples are all using the SQL-like features of Spark. Not a map() or flatMap() in sight.

So... why not just write SQL?

  df.registerTempTable('some_name')
  new_df = spark.sql("""select ... from some_name ...""")

All of this F.col(...) and .alias(...) and .withColumn(...) nonsense is a million times harder to read than proper SQL. I just don't understand what any of this is intended to accomplish.

MrPowers5y ago

The spark.sql("""select ... from some_name ...""") bit is how to write pure SQL from the Scala or Python execution context. If you're in the SQL execution context, this syntax isn't required. I never write Spark code like this.

At first glance, the F.col(), .withColumn() syntax isn't as intuitive as pure SQL, but it has a lot of advantages when you get used to it. You can make abstractions, use programming language features like loops, and use IDEs.

I find the PySpark syntax to be uglier than Scala. Lots of teams are terrified to use Scala and that's the reason PySpark is so popular.

legerdemain5y ago

Because the examples use the Dataframe API and not the RDD API? At least as of Spark 2.4.7, both `Dataset.map()` and `Dataset.flatMap()` are still marked experimental.

gostsamo5y ago

> The preferred option is more complicated, longer, and polluted - and correct.

This is the definition of bad design.

dazfuller5y ago

So much eye rolling. It's a mixture of some common sense, some bad suggestions (such as this one), and a couple of mandates such as "It is highly recommended to avoid UDFs in all situations" rather than providing any real guidance.

jeroenvlek5y ago

Was also rolling my eyes at that one. Furthermore, if you are this concerned with the refactoring limitations of your IDE, then don't use Python, but the Scala or Java API instead.

legerdemain5y ago

I think this guide mostly dates from 2017, when Palantir was rolling out Spark and Spark SQL code authoring in their Foundry data platform. It mostly targets their untrained "delta" and "echo" employees, most of whose jobs rotated around writing ETL code for customers. I have no idea why this glorified Quip doc was open-sourced.

Looking at the list of contributors on Github, I think I remember that the main author was actually James Thompson (UK), and not anyone on the contributor list. JTUK was called that because the company had another employee, based in the US, who had the same name. James Thompson (US) is now at Facebook and is a pretty cool designer. His astronomer.io media project from 2011 comes up on HN periodically.

Of the people listed on Github, Andrew Ash (now at Google) is the original evangelist for Spark on Palantir Foundry, and Francisco is the PM for SkyWise, Palantir's lucrative, but ill-fated effort to save the Airbus A380.

nautilus125y ago

Why not just use Scala and frameless? No style guide needed :)

xiaodai5y ago

Spark is a cancer. Sooner or later, 99.9% of the people using Spark will wake up to the fact that "hey, I got 1TB of RAM， why do I need this?"

Spark and PySpark are just PITA to the max.

nautilus125y ago

People that take this attitude against distributed processing usually have never had to process an amount of data bigger than would fit on one machine or always have the budget to fit everything on one very expensive large machine. It's lack of experience masquerading as cleverness. The only ones that have a right to make this arguments are the people that spread all their processing out as streams on one machine or using distributed streams but even that has serious limitation.

matteuan5y ago

If there was something better than Spark for distributed processing, we would be using it. The rest of your comment is a straw man argument, assuming everybody uses it for datasets fitting in memory of a single node.

gberger5y ago

What do you recommend for distributed data processing?

MrPowers5y ago

Dask is a great alternative for distributed computing as well: https://github.com/dask/dask

IMO, Spark is better for some tasks and Dask is better for others.

peteradio5y ago

First step is decide if you really need distributed data processing. I think this is the point author is making. I've seen GB sized data considered "BIG DATA" and its unbelievable the architectural patterns used to support this "BIG DATA".

j / k navigate · click thread line to collapse

26 comments

MrPowers5y ago

Here's the Scala Spark style guide: https://github.com/MrPowers/spark-style-guide

The chispa README also provides a lot of useful info on how to properly write PySpark code: https://github.com/MrPowers/chispa

Scala is easier than Python for Spark because it allows functions with multiple argument lists and isn't whitespace sensitive. Both are great & Spark is a lot of fun.

Some specific notes:

> Doing a select at the beginning of a PySpark transform, or before returning, is considered good practice

> Be careful with joins! If you perform a left join, and the right side has multiple matches for a key, that row will be duplicated as many times as there are matches

choppaface5y ago

It’s notable that in Spark 3.x, Koalas is standard, which adopts the Pandas API. Yet this style guide uses the Spark DataFrame API. So the guide might be a little stale anyways.

MrPowers5y ago

I'd argue Koalas is an anti-pattern but will have to justify that in a blog post ;)

1 more reply

speedgoose5y ago

I wonder whether I should read about best practices from Palantir.

legerdemain5y ago

nxpnsv5y ago

They'll know whether you read it or not...

fmsfOP5y ago

There is also a blog post on this: https://medium.com/palantir/a-pyspark-style-guide-for-real-w...

em5005y ago

tandav5y ago

another thing is that python is so cool for data processing, and when working with plain sql I feel lack of

    .rdd.map(my_python_processing_function)

MrPowers5y ago

iblaine5y ago

Could not agree more. Similar to the The Principle of Least Privilege [1], I prefer to use SQL over pyspark if possible. [1] https://us-cert.cisa.gov/bsi/articles/knowledge/principles/l...

slotrans5y ago

I'm so confused.

These examples are all using the SQL-like features of Spark. Not a map() or flatMap() in sight.

So... why not just write SQL?

  df.registerTempTable('some_name')
  new_df = spark.sql("""select ... from some_name ...""")

All of this F.col(...) and .alias(...) and .withColumn(...) nonsense is a million times harder to read than proper SQL. I just don't understand what any of this is intended to accomplish.

MrPowers5y ago

I find the PySpark syntax to be uglier than Scala. Lots of teams are terrified to use Scala and that's the reason PySpark is so popular.

legerdemain5y ago

Because the examples use the Dataframe API and not the RDD API? At least as of Spark 2.4.7, both `Dataset.map()` and `Dataset.flatMap()` are still marked experimental.

gostsamo5y ago

> The preferred option is more complicated, longer, and polluted - and correct.

This is the definition of bad design.

dazfuller5y ago

jeroenvlek5y ago

Was also rolling my eyes at that one. Furthermore, if you are this concerned with the refactoring limitations of your IDE, then don't use Python, but the Scala or Java API instead.

legerdemain5y ago

nautilus125y ago

Why not just use Scala and frameless? No style guide needed :)

xiaodai5y ago

Spark is a cancer. Sooner or later, 99.9% of the people using Spark will wake up to the fact that "hey, I got 1TB of RAM， why do I need this?"

Spark and PySpark are just PITA to the max.

nautilus125y ago

matteuan5y ago

gberger5y ago

What do you recommend for distributed data processing?

MrPowers5y ago

Dask is a great alternative for distributed computing as well: https://github.com/dask/dask

IMO, Spark is better for some tasks and Dask is better for others.

peteradio5y ago

j / k navigate · click thread line to collapse