Demystifying Apache Arrow (2020) (opens in new tab)

(robinlinacre.com)

197 pointsdmlorenzetti3y ago47 comments

47 comments

Author here. Since I wrote this, Arrow seems to be be more and more pervasive. As a data engineer, the adoption of Arrow (and parquet) as a data exchange format has so much value. It's amazing how much time me and colleagues have spent on data type issues that have arisen from the wide range of data tooling (R, Pandas, Excel etc. etc.). So much so that I try to stick to parquet, using SQL where possible to easily preserve data types (pandas is a particularly bad offender for managing data types).

In doing so, I'm implicitly using Arrow - e.g. with Duckdb, AWS Athena and so on. The list of tools using Arrow is long! https://arrow.apache.org/powered_by/

Another interesting development since I wrote this is DuckDB.

DuckDB offers a compute engine with great performance against parquet files and other formats. Probably similar performance to Arrow. It's interesting they opted to write their own compute engine rather than use Arrow's - but I believe this is partly because Arrow was immature when they were starting out. I mention it because, as far as I know, there's not yet an easy SQL interface to Arrow from Python.

Nonetheless, DuckDB are still Arrow for some of its other features: https://duckdb.org/2021/12/03/duck-arrow.html

Arrow also has a SQL query engine: https://arrow.apache.org/blog/2019/02/04/datafusion-donation...

I might be wrong about this - but in my experience, it feels like there's more consensus around the Arrow format, as opposed to the compute side.

Going forward, I see parquet continuing on its path to becoming a de facto standard for storing and sharing bulk data. I'm particularly excited about new tools that allow you to process it in the browser. I've written more about this just yesterday: https://www.robinlinacre.com/parquet_api/, discussion: https://news.ycombinator.com/item?id=34310695.

hintymad3y ago

Thanks for sharing your insights. Any comments on Feather vs Parquet? If we don't need to support tools that can only interact with Parquet, how will Feather pan out as a Parquet alternative (or Feather can't be such alternative at all)?

reichardt3y ago

I recently looked into this as well. Specifically how the two formats differ. As it stands right now the “Feather” file format seems to be a synonym for the Arrow IPC file format or “Arrow files” [0]. There should be basically no overhead while reading into the arrow memory format [1]. Parquet files on the other hand are stored in a different format and therefore occur some overhead while reading into memory but offer more advanced mechanism for on disk encoding and compression [1].

As far as I can tell the main trade-off seems to be around deserialization overhead vs on disk file size. If anyone has more information or experience with the topic I'd love to hear!

[0] https://arrow.apache.org/faq/#what-about-the-feather-file-fo... [1] https://arrow.apache.org/faq/#what-is-the-difference-between...

EDIT:

More information: https://news.ycombinator.com/item?id=34324649

1 more reply

sanderjd3y ago

Since you know a bunch about this, I'm going to ask you a question that I was about to research: If I have a dataset in memory in Arrow, but I want to cache it to disk to read back in later, what is the most efficient way to do that at this moment in time? Is it to write to parquet and read the parquet back into memory, or is there a more efficient way to write the native Arrow format such that it can be read back in directly? I think this sounds kind of like Flight, except that my understanding is that is intended for moving the data across a network rather than temporally across a disk.

RobinL3y ago

I'm not an expert in the nuts and bolts of Arrow, but I think you have two options:

- Save to feather format. Feather format is essentially the same thing as the Arrow in-memory format. This is uncompressed and so if you have super fast IO, it'll read back to memory faster, or at least, with minimal CPU usage.

- Save to compressed parquet format. Because you're often IO bound, not CPU bound, this may read back to memory faster, at the expense of the CPU usage of decompressing.

On a modern machine with a fast SSD, I'm not sure which would be faster. If you're saving to remote blob storage e.g. S3, parquet will almost certainly be faster.

1 more reply

sdz3y ago

You're probably looking for the Arrow IPC format [1], which writes the data in close to the same format as the memory layout. On some platforms, reading this back is just an mmap and can be done with zero copying. Parquet, on the other hand, is a somewhat more complex format and there will be some amount of encoding and decoding on read/write. Flight is an RPC framework that essentially sends Arrow data around in IPC format.

[1] https://arrow.apache.org/docs/python/ipc.html

kajika913y ago

I prefer looking at benchmarks : https://towardsdatascience.com/the-best-format-to-save-panda...

I have used Arrow and even made my humble contribution to the Go binding but I don't like pretending it is so much better than other solutions. It is not a silver bullet and probably the best pro is the "non-copy" goal to convert data into different frameworks' object. Depending of the use for the data columnar layout can be better but not always.

Lyngbakr3y ago

This has been a game changer for us. When our analysts run queries on parquets using Arrow they are orders of magnitude faster than equivalent SQL queries on databases.

RobinL3y ago

Author here! I've actually just written a separate blog post on a similar topic! https://www.robinlinacre.com/parquet_api/

Parquet seems to be on a path to become the de facto standard for storing and sharing bulk data - for good reason! (discussion: https://news.ycombinator.com/item?id=34310695)

the_black_hand3y ago

very cool post. thanks

BeefWellington3y ago

Were you working off proper data warehouses, or just the transactional db?

I ask because something a lot of people miss here is how much performance you can get from the T part of ETL. Denormalizing everything into big simple inflated tables makes things orders of magnitude faster. It matters quite a bit what your comparison is against.

Lyngbakr3y ago

We saw major improvements when we simply wrote full tables from a transactional database to parquet, but also, as you say, modelling the data appropriately produced significant improvements, too.

1 more reply

alamb3y ago

Here is another blog post that offers some perspective on the growth of Arrow over the intervening years and future directions: https://www.datawill.io/posts/apache-arrow-2022-reflection/

RobinL3y ago

That's really good, thanks. Better than my blog, actually - the author has a much deeper understanding and I learned a lot by reading it. I was coming at it from the perspective of someone very confused by Arrow, and wrote the blog to help myself understand it!

agumonkey3y ago

Very interesting project

ps: a tiny video to explain storage layout optimizations https://yewtu.be/watch?v=dPb2ZXnt2_U

gizmodo593y ago

There is a bunch of other projects that grew out of arrow which are also contributing a lot to data engineering: https://www.dremio.com/blog/apache-arrows-rapid-growth-over-...

flakiness3y ago

FYI: A recent "Data Analysis Podcast" interviews the Arrows founder Wes McKinney on this topic.

https://roundup.getdbt.com/p/ep-37-what-does-apache-arrow-un...

hermitcrab3y ago

I have written a desktop data wrangling/ETL tool for Windows/Mac (Easy Data Transform). It is designed to handle millions of rows (but not billions). Currently it mostly inputs and outputs CSV, Excel, XML and JSON. I am looking to add some additional formats in future, such as SQLite, Parquet or DuckBD. Maybe I need to look at Feather as well? I could also use one of these formats to store intermediate datasets to disk, rather than holding everything in memory. If anyone has any experience in integrating any of these formats into a C++ application on Windows and/or Mac, I would be interested to hear how you got on and what libraries you used.

d_burfoot3y ago

Can someone comment on the code quality of Arrow vs other Apache data engineering tools?

I have been burned so many times by amateur hour software engineering failures from the Apache world, that it’s very hard for me to ever willingly adopt anything from that brand again. Just put it in gripped JSon or TSV and hey, if there’s a performance penalty, it’s better to pay a bit more for cloud compute than hate your job because of some nonsense dependency issue caused by an org.Apache library failing to follow proper versioning guidelines.

rfoo3y ago

Arrow the format is pretty good, there are occasional quirks (null bitmap has 1 = non-null etc) but no big deal.

From my experience Arrow the C++ implementation is pretty solid too, though I don't like it (taste). I just don't like their "force std::shared_ptr over Array, Table, Schema and basically everything" approach, why don't use an intrusive ref count if the object could only be hold by shared_ptr anyways? There are also a lot of const std::shared_ptr<Array>& arguments on not-obvious-when-it-takes-ownership functions. And immutable Array + ArrayBuilder (versus COW/switch between mutable uniquely owned and immutable shared in ClickHouse and friends), so if you have to fill the data out of order you are forced to buffer your data on your side.

Do note that the compute engine (e.g. Velox) may still need to implement their own (Arrow compatible) array types as there aren't many fancy encodings in Arrow the format.

sanderjd3y ago

Arrow (and the ecosystem around it that I've looked into, namely DataFusion) seems really solid and well-engineered to me.

rr8883y ago

I always thought the file format was going to be tightly bound to Arrow but looks like they aren't encouraging feather. Should we just be using Parquet for file storage?

RobinL3y ago

Yes - save to parquet. From the OP:

"Why not just persist the data to disk in Arrow format, and thus have a single, cross-language data format that is the same on-disk and in-memory? One of the biggest reasons is that Parquet generally produces smaller data files, which is more desirable if you are IO-bound. This will especially be the case if you are loading data from cloud storage like such as AWS S3.

Julien LeDem explains this further in a blog post discussing the two formats:

>> The trade-offs for columnar data are different for in-memory. For data on disk, usually IO dominates latency, which can be addressed with aggressive compression, at the cost of CPU. In memory, access is much faster and we want to optimise for CPU throughput by paying attention to cache locality, pipelining, and SIMD instructions. https://www.kdnuggets.com/2017/02/apache-arrow-parquet-colum..."

mempko3y ago

I opted to store feather for one particular reason. You can open it using mmap and randomly index the data without having to load it all in memory. Also the data I have isn't very compressible to begin with, so the cpu cost vs data savings of parquet don't make sense. This only makes sense in that narrow use case.

2 more replies

kordlessagain3y ago

FeatureBase uses Arrow for processing data stored in bitmap format: https://featurebase.com/

mjburgess3y ago

> Learning more about a tool that can filter and aggregate two billion rows on a laptop in two seconds

If someone has a code example to this effect, I'd be greatful.

I was once engaged in a salesy pitch by a cloud advocate that BigQuery (et al.) can "process a billion rows a second".

I tried to create an SQLite example with a billion rows to show that this isn't impressive, but I gave up after some obstacles to generating the data.

It would be nice to have an example like this to show developers (, engineers) who have become accustomed to the extreme levels of CPU abuse today, to show that modern laptops really are supercomputers.

It should be obvious that a laptop can rival a data centre at 90% of ordinary tasks, that it isn't in my view, has a lot to do with the state of OS/Browser/App/etc. design & performance. Supercomputers, alas, dedicated to drawing pixels by way of a dozen layers of indirection.

RobinL3y ago

An example using R code is here: https://arrow.apache.org/docs/r/articles/dataset.html

The speed comes from the raw speed of arrow, but also a 'trick'. If you apply a filter, this is pushed down to the raw parquet files so some don't need to be read at all due to the hive-style organisation

Another trick is that parquet files store some summary statistics in their metadata. This means, for example, that if you want to find the max of a column, only the metadata needs to be read, rather than the data itself.

I'm a Python user myself, but the code would be comparable on the Python side

thinkharderdev3y ago

You can see some of the benchmarks in DataFusion (part of the Arrow project and built with Arrow as the underlying in-memory format) https://github.com/apache/arrow-datafusion/blob/master/bench...

Disclaimer: I'm a committer on the Arrow project and contributor to DataFusion.

StreamBright3y ago

You can try the examples or datafusion with flight. I have been able to process data with that setup in Rust under milliseconds that usually takes tens of seconds with a distributed query engine. I think Rust combined with Arrow, Flight, Parquet can be a game changer for analytics after a decade of Java with Hadoop & co.

cmollis3y ago

completely agree with this. Rust and arrow will be part of the next set of toolsets for data engineering. Spark is great and I use it every day but it's big and cumbersome to use. There are use-cases today that are being addressed by datafusion, duckdb, (to a certain extent, pandas).. that will continue to evolve.. hopefully ballista can mature to a point where it's a real spark alternative for distributed computations. Spark isn't standing still of course and we're already seeing a lot of different drop in C++ SQL engines.. but moving entirely away from the JVM would be a watershed, IMO

tveita3y ago

Clickhouse or DuckDB are databases I would look at that support this use case pretty much "out of the box"

E.g. https://benchmark.clickhouse.com has some query times for a 100 million row dataset.

spaniard892773y ago

DuckDB is so simple to work with. It's only worth to look elsewhere with real big data, or where you really need a client-server setup.

I hope it receives more love.

1 more reply

intelVISA3y ago

Having used all three I'd go with Clickhouse/DuckDB over Arrow every time.

1 more reply

mihevc3y ago

Here are some cookbook examples: https://arrow.apache.org/cookbook/py/data.html#group-a-table, https://arrow.apache.org/cookbook/. Datasets would probably be a good approach for the billions size, see: https://blog.djnavarro.net/posts/2022-11-30_unpacking-arrow-...

cube22223y ago

Generally, operating on raw numbers in a columnar layout is very very fast, even if you just write it as a straightforward loop.

amayui3y ago

We've been thinking about using Parquet as a serialization format during data exchange in our project as well.

j / k navigate · click thread line to collapse

47 comments

RobinL3y ago

In doing so, I'm implicitly using Arrow - e.g. with Duckdb, AWS Athena and so on. The list of tools using Arrow is long! https://arrow.apache.org/powered_by/

Another interesting development since I wrote this is DuckDB.

Nonetheless, DuckDB are still Arrow for some of its other features: https://duckdb.org/2021/12/03/duck-arrow.html

Arrow also has a SQL query engine: https://arrow.apache.org/blog/2019/02/04/datafusion-donation...

I might be wrong about this - but in my experience, it feels like there's more consensus around the Arrow format, as opposed to the compute side.

hintymad3y ago

reichardt3y ago

As far as I can tell the main trade-off seems to be around deserialization overhead vs on disk file size. If anyone has more information or experience with the topic I'd love to hear!

[0] https://arrow.apache.org/faq/#what-about-the-feather-file-fo... [1] https://arrow.apache.org/faq/#what-is-the-difference-between...

EDIT:

More information: https://news.ycombinator.com/item?id=34324649

1 more reply

sanderjd3y ago

RobinL3y ago

I'm not an expert in the nuts and bolts of Arrow, but I think you have two options:

- Save to compressed parquet format. Because you're often IO bound, not CPU bound, this may read back to memory faster, at the expense of the CPU usage of decompressing.

On a modern machine with a fast SSD, I'm not sure which would be faster. If you're saving to remote blob storage e.g. S3, parquet will almost certainly be faster.

1 more reply

sdz3y ago

[1] https://arrow.apache.org/docs/python/ipc.html

kajika913y ago

I prefer looking at benchmarks : https://towardsdatascience.com/the-best-format-to-save-panda...

Lyngbakr3y ago

This has been a game changer for us. When our analysts run queries on parquets using Arrow they are orders of magnitude faster than equivalent SQL queries on databases.

RobinL3y ago

Author here! I've actually just written a separate blog post on a similar topic! https://www.robinlinacre.com/parquet_api/

Parquet seems to be on a path to become the de facto standard for storing and sharing bulk data - for good reason! (discussion: https://news.ycombinator.com/item?id=34310695)

the_black_hand3y ago

very cool post. thanks

BeefWellington3y ago

Were you working off proper data warehouses, or just the transactional db?

Lyngbakr3y ago

We saw major improvements when we simply wrote full tables from a transactional database to parquet, but also, as you say, modelling the data appropriately produced significant improvements, too.

1 more reply

alamb3y ago

Here is another blog post that offers some perspective on the growth of Arrow over the intervening years and future directions: https://www.datawill.io/posts/apache-arrow-2022-reflection/

RobinL3y ago

agumonkey3y ago

Very interesting project

ps: a tiny video to explain storage layout optimizations https://yewtu.be/watch?v=dPb2ZXnt2_U

gizmodo593y ago

There is a bunch of other projects that grew out of arrow which are also contributing a lot to data engineering: https://www.dremio.com/blog/apache-arrows-rapid-growth-over-...

flakiness3y ago

FYI: A recent "Data Analysis Podcast" interviews the Arrows founder Wes McKinney on this topic.

https://roundup.getdbt.com/p/ep-37-what-does-apache-arrow-un...

hermitcrab3y ago

d_burfoot3y ago

Can someone comment on the code quality of Arrow vs other Apache data engineering tools?

rfoo3y ago

Arrow the format is pretty good, there are occasional quirks (null bitmap has 1 = non-null etc) but no big deal.

Do note that the compute engine (e.g. Velox) may still need to implement their own (Arrow compatible) array types as there aren't many fancy encodings in Arrow the format.

sanderjd3y ago

Arrow (and the ecosystem around it that I've looked into, namely DataFusion) seems really solid and well-engineered to me.

rr8883y ago

I always thought the file format was going to be tightly bound to Arrow but looks like they aren't encouraging feather. Should we just be using Parquet for file storage?

RobinL3y ago

Yes - save to parquet. From the OP:

Julien LeDem explains this further in a blog post discussing the two formats:

mempko3y ago

2 more replies

kordlessagain3y ago

FeatureBase uses Arrow for processing data stored in bitmap format: https://featurebase.com/

mjburgess3y ago

> Learning more about a tool that can filter and aggregate two billion rows on a laptop in two seconds

If someone has a code example to this effect, I'd be greatful.

I was once engaged in a salesy pitch by a cloud advocate that BigQuery (et al.) can "process a billion rows a second".

I tried to create an SQLite example with a billion rows to show that this isn't impressive, but I gave up after some obstacles to generating the data.

RobinL3y ago

An example using R code is here: https://arrow.apache.org/docs/r/articles/dataset.html

I'm a Python user myself, but the code would be comparable on the Python side

thinkharderdev3y ago

You can see some of the benchmarks in DataFusion (part of the Arrow project and built with Arrow as the underlying in-memory format) https://github.com/apache/arrow-datafusion/blob/master/bench...

Disclaimer: I'm a committer on the Arrow project and contributor to DataFusion.

StreamBright3y ago

cmollis3y ago

tveita3y ago

Clickhouse or DuckDB are databases I would look at that support this use case pretty much "out of the box"

E.g. https://benchmark.clickhouse.com has some query times for a 100 million row dataset.

spaniard892773y ago

DuckDB is so simple to work with. It's only worth to look elsewhere with real big data, or where you really need a client-server setup.

I hope it receives more love.

1 more reply

intelVISA3y ago

Having used all three I'd go with Clickhouse/DuckDB over Arrow every time.

1 more reply

mihevc3y ago

cube22223y ago

Generally, operating on raw numbers in a columnar layout is very very fast, even if you just write it as a straightforward loop.

amayui3y ago

We've been thinking about using Parquet as a serialization format during data exchange in our project as well.

j / k navigate · click thread line to collapse