In my opinion, the point about formats remains and the purpose of Arrow is lost.
I don't know enough about Arrow, but surely there is a better storage format than Parquet, all though storage isn't primary consideration for Arrow. The purpose of Arrow is not to have to convert from one format to another. Data can be efficiently transferred from RAM across the wire to RAM again without any significant transformation.
Surely there is a storage representation for Arrow to deliver similar characteristics as its intended use? Eg:
disk (arrow format?) -> RAM (arrow)
instead of disk (parquet) -> RAM (parquet) -> CPU (transform) -> RAM (arrow)
There is a penalty (time and memory) dealing with parquet (or any intermediary format) and then transforming it to Arrow. What's the point of using Arrow if this is what you're going to do? Just use a parquet library instead of arrow (it's unclear to me what is actually performing the query in the Arrow step? Is it the R dataframe query, or did it push the query down to the Arrow data engine?).After all, isn't this exactly how SQLite is being tested in this case? The original data file is loaded into SQLite and stored in SQLite's native file format providing all of the des-ser advantages SQLite provides out of the box. Not to mention the indexing that's defined as part of this preparation.