undefined | Better HN

0 pointsandruby2y ago0 comments

What is Spark?

I assume that’s Apache Spark, which is described as a “ unified analytics engine for large-scale data processing”

Still not clear for me what to use it for :-/

0 comments

It is Apache Spark. It's a framework that allows processing large amounts of data in parallel on a cluster of computers.

You can use batch processing, streaming, do machine learning and graph jobs. You usually use Scala, Java, Python or R to write your code. The code is executed in Scala, so it all gets converted to it. For example, in Python you'd use PySpark and that gets written down to its scala equivalent which is then executed.

I mainly work in Python, so I'm going to talk about some features there. But it support dataframes and exposes the data in Spark DataFrames. You build operations and those slowly build a DAG. It's not until you either execute, save or request to see the data that it actually starts executing the DAG after optimizing what it needs.

If you need something that spark doesn't support, you can use regular python, but because it won't get converted to spark, it'll run on only one node and be limited. So you have to rewrite your code optimizing for it.

You can process some data in memory, you can use disk, you can use databases. Either as source or targets.

A use case can be, load the raw data as it comes in, transform the data to your intermediary states, then write out different tables based on what they need to do.

---

It's a framework that has an engine to manage code running on clusters, a language to interact with the data, abstractions and optimizations of the code, ways to store the data, checkpoints for optimizations, and other things.

j / k navigate · click thread line to collapse

0 comments

rovr1382y ago

It is Apache Spark. It's a framework that allows processing large amounts of data in parallel on a cluster of computers.

You can process some data in memory, you can use disk, you can use databases. Either as source or targets.

A use case can be, load the raw data as it comes in, transform the data to your intermediary states, then write out different tables based on what they need to do.

---

j / k navigate · click thread line to collapse