undefined | Better HN

0 pointstopper-1235y ago0 comments

I can't imagine that `.to_parquet` takes any time at all, relative to `groupby.agg`. But yeah, It would be nice to get seperate benchnmarks for the two parts of your benchmark.

Maybe one-factor groupbys are faster in pandas, while two-factor groupbys (as in https://h2oai.github.io/db-benchmark/) are slower?

0 comments

akhong5y ago

Yes, I agree with you in Pandas' case. However, for other libraries, a good chunk of the run time comes from reading the parquet files and concatenating the partial datasets. Pandas and Spark are particularly really good with reading a directory of 12 Parquet files with no noticeable performance penalty.

j / k navigate · click thread line to collapse

0 pointstopper-1235y ago0 comments

I can't imagine that `.to_parquet` takes any time at all, relative to `groupby.agg`. But yeah, It would be nice to get seperate benchnmarks for the two parts of your benchmark.

Maybe one-factor groupbys are faster in pandas, while two-factor groupbys (as in https://h2oai.github.io/db-benchmark/) are slower?

0 comments

akhong5y ago

j / k navigate · click thread line to collapse