https://h2oai.github.io/db-benchmark/ is obligatory to show as well. I think you should probably note that this isn't just DataFrames.jl but also in conjunction with Queryverse tools. the reason is because it's somewhat known that the Queryverse tools are very nice to use but not as performant as other parts of the Julia programming language (performance is a result of the language and how it's used). For example, the Parquet to DataFrame conversion that you're using has known performance issues: https://github.com/queryverse/ParquetFiles.jl/issues/32
In general, very nice benchmark contribution and thanks for helping showcase the performance landscape.
That `groupby.apply` is a lot slower than `groupby.agg` does not surprise me at all: `groupby.apply` can do a lot of things that `groupby.agg` can't do, at the cost of being potentially a lot slower. In general, `groupby.apply` should only be used, when `groupby.agg` can't do the job.
However, are you saying that pandas's `groupby.agg` is faster than r's data.table, julia and clojure? That surprises me a lot.
One possible explanation I could think of is that Pandas support for Parquet is pretty good compared to data.table and Julia. I've been asked to split the read/write part and the groupby-agg part for a more complete picture. I'll be sure to work on that in the coming weeks.
Another hypothesis by u/joinr about why Pandas performs better in the smaller dataset:
"I wonder if there's some default column size allocation that happens up front for the 2^6 case that helps prevent growth in pandas, and maybe the hueristic falls down a little as the dataset gets larger leading to more resizing."
Maybe one-factor groupbys are faster in pandas, while two-factor groupbys (as in https://h2oai.github.io/db-benchmark/) are slower?