In doing so, I'm implicitly using Arrow - e.g. with Duckdb, AWS Athena and so on. The list of tools using Arrow is long! https://arrow.apache.org/powered_by/
Another interesting development since I wrote this is DuckDB.
DuckDB offers a compute engine with great performance against parquet files and other formats. Probably similar performance to Arrow. It's interesting they opted to write their own compute engine rather than use Arrow's - but I believe this is partly because Arrow was immature when they were starting out. I mention it because, as far as I know, there's not yet an easy SQL interface to Arrow from Python.
Nonetheless, DuckDB are still Arrow for some of its other features: https://duckdb.org/2021/12/03/duck-arrow.html
Arrow also has a SQL query engine: https://arrow.apache.org/blog/2019/02/04/datafusion-donation...
I might be wrong about this - but in my experience, it feels like there's more consensus around the Arrow format, as opposed to the compute side.
Going forward, I see parquet continuing on its path to becoming a de facto standard for storing and sharing bulk data. I'm particularly excited about new tools that allow you to process it in the browser. I've written more about this just yesterday: https://www.robinlinacre.com/parquet_api/, discussion: https://news.ycombinator.com/item?id=34310695.
As far as I can tell the main trade-off seems to be around deserialization overhead vs on disk file size. If anyone has more information or experience with the topic I'd love to hear!
[0] https://arrow.apache.org/faq/#what-about-the-feather-file-fo... [1] https://arrow.apache.org/faq/#what-is-the-difference-between...
EDIT:
More information: https://news.ycombinator.com/item?id=34324649
- Save to feather format. Feather format is essentially the same thing as the Arrow in-memory format. This is uncompressed and so if you have super fast IO, it'll read back to memory faster, or at least, with minimal CPU usage.
- Save to compressed parquet format. Because you're often IO bound, not CPU bound, this may read back to memory faster, at the expense of the CPU usage of decompressing.
On a modern machine with a fast SSD, I'm not sure which would be faster. If you're saving to remote blob storage e.g. S3, parquet will almost certainly be faster.
I have used Arrow and even made my humble contribution to the Go binding but I don't like pretending it is so much better than other solutions. It is not a silver bullet and probably the best pro is the "non-copy" goal to convert data into different frameworks' object. Depending of the use for the data columnar layout can be better but not always.
Parquet seems to be on a path to become the de facto standard for storing and sharing bulk data - for good reason! (discussion: https://news.ycombinator.com/item?id=34310695)
I ask because something a lot of people miss here is how much performance you can get from the T part of ETL. Denormalizing everything into big simple inflated tables makes things orders of magnitude faster. It matters quite a bit what your comparison is against.
ps: a tiny video to explain storage layout optimizations https://yewtu.be/watch?v=dPb2ZXnt2_U
https://roundup.getdbt.com/p/ep-37-what-does-apache-arrow-un...
I have been burned so many times by amateur hour software engineering failures from the Apache world, that it’s very hard for me to ever willingly adopt anything from that brand again. Just put it in gripped JSon or TSV and hey, if there’s a performance penalty, it’s better to pay a bit more for cloud compute than hate your job because of some nonsense dependency issue caused by an org.Apache library failing to follow proper versioning guidelines.
From my experience Arrow the C++ implementation is pretty solid too, though I don't like it (taste). I just don't like their "force std::shared_ptr over Array, Table, Schema and basically everything" approach, why don't use an intrusive ref count if the object could only be hold by shared_ptr anyways? There are also a lot of const std::shared_ptr<Array>& arguments on not-obvious-when-it-takes-ownership functions. And immutable Array + ArrayBuilder (versus COW/switch between mutable uniquely owned and immutable shared in ClickHouse and friends), so if you have to fill the data out of order you are forced to buffer your data on your side.
Do note that the compute engine (e.g. Velox) may still need to implement their own (Arrow compatible) array types as there aren't many fancy encodings in Arrow the format.
"Why not just persist the data to disk in Arrow format, and thus have a single, cross-language data format that is the same on-disk and in-memory? One of the biggest reasons is that Parquet generally produces smaller data files, which is more desirable if you are IO-bound. This will especially be the case if you are loading data from cloud storage like such as AWS S3.
Julien LeDem explains this further in a blog post discussing the two formats:
>> The trade-offs for columnar data are different for in-memory. For data on disk, usually IO dominates latency, which can be addressed with aggressive compression, at the cost of CPU. In memory, access is much faster and we want to optimise for CPU throughput by paying attention to cache locality, pipelining, and SIMD instructions. https://www.kdnuggets.com/2017/02/apache-arrow-parquet-colum..."
If someone has a code example to this effect, I'd be greatful.
I was once engaged in a salesy pitch by a cloud advocate that BigQuery (et al.) can "process a billion rows a second".
I tried to create an SQLite example with a billion rows to show that this isn't impressive, but I gave up after some obstacles to generating the data.
It would be nice to have an example like this to show developers (, engineers) who have become accustomed to the extreme levels of CPU abuse today, to show that modern laptops really are supercomputers.
It should be obvious that a laptop can rival a data centre at 90% of ordinary tasks, that it isn't in my view, has a lot to do with the state of OS/Browser/App/etc. design & performance. Supercomputers, alas, dedicated to drawing pixels by way of a dozen layers of indirection.
The speed comes from the raw speed of arrow, but also a 'trick'. If you apply a filter, this is pushed down to the raw parquet files so some don't need to be read at all due to the hive-style organisation
Another trick is that parquet files store some summary statistics in their metadata. This means, for example, that if you want to find the max of a column, only the metadata needs to be read, rather than the data itself.
I'm a Python user myself, but the code would be comparable on the Python side
Disclaimer: I'm a committer on the Arrow project and contributor to DataFusion.
E.g. https://benchmark.clickhouse.com has some query times for a 100 million row dataset.
I hope it receives more love.