undefined | Better HN

0 pointskylebarron2y ago0 comments

Sorry, this is not true _at all_ for geospatial data.

A quick benchmark [0] shows that saving to GeoPackage, FlatGeobuf, and GeoParquet are roughly 10x faster than saving to CSV. Additionally, the CSV is much larger than any other format.

[0]: https://gist.github.com/kylebarron/f632bbf95dbb81c571e4e64cd...

0 comments

culebron212y ago

And here's my quick benchmark, dataset from my full-time job:

  > import geopandas as gpd
  > import pandas as pd
  > from shapely.geometry import Point

  > d = pd.read_csv('data/tracks/2024_01_01.csv')
  > d.shape
  (3690166, 4)
  > list(d)
  ['user_id', 'timestamp', 'lat', 'lon']

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv')
  14.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > d2 = gpd.GeoDataFrame(d.drop(['lon', 'lat'], axis=1), geometry=gpd.GeoSeries([Point(*i) for i in d[['lon', 'lat']].values]), crs=4326)
  > d2.shape, list(d2)
  ((3690166, 3), ['user_id', 'timestamp', 'geometry'])

  > %%timeit -n 1
  > d2.to_file('/tmp/test.gpkg')
  4min 32s ± 7.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv.gz')
  37.4 s ± 291 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > ls -lah /tmp/test*
  -rw-rw-r-- 1 culebron culebron 228M мар 26 21:10 /tmp/test.csv
  -rw-rw-r-- 1 culebron culebron  63M мар 26 22:03 /tmp/test.csv.gz
  -rw-r--r-- 1 culebron culebron 423M мар 26 21:58 /tmp/test.gpkg

CSV saved in 15s, GPKG in 272s. 18x slowdown.

I guess your dataset is countries borders, isn't it? Something that 1) has few records and makes a small r-tree, and 2) contains linestrings/polygons that can be densified, similar to Google Polyline algorithm.

But a lot of geospatial data is just sets of points. For instance: housing per entire country (couple of million points). Address database (IIRC 20+M points). Or GPS logs of multiple users, received from logging database, ordered by time, not assembled in tracks -- several million per day.

For such datasets, use CSV, don't abuse indexed formats. (Unless you store it for a long time and actually use the index for spatial search, multiple times.)

kylebarronOP1y ago

Your issue is that you're using the default (old) binding to GDAL, based on Fiona [0].

You need to use pyogrio [1], its vectorized counterpart, instead. Make sure you use `engine="pyogrio"` when calling `to_file` [2]. Fiona does a loop in Python, while pyogrio is exclusively compiled. So pyogrio is usually about 10-15x faster than fiona. Soon, in pyogrio version 0.8, it will be another ~2-4x faster than pyogrio is now [3].

[0]: https://github.com/Toblerity/Fiona

[1]: https://github.com/geopandas/pyogrio

[2]: https://geopandas.org/en/stable/docs/reference/api/geopandas...

[3]: https://github.com/geopandas/pyogrio/pull/346

culebron211y ago

CSV is still faster than geo-formats with pyogrio. From what I saw, it writes most of the file quickly, then spends a lot of time, I think, building the index.

        > %%timeit -n 1
        > d.to_csv('/tmp/test.csv')
        10.8 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d2.to_file('/tmp/test.gpkg', engine='pyogrio')
        1min 15s ± 5.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d.to_csv('/tmp/test.csv.gz')
        35.3 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d2.to_file('/tmp/test.fgb', driver='FlatGeobuf', engine='pyogrio')
        19.9 s ± 512 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > ls -lah /tmp/test*
        -rw-rw-r-- 1 culebron culebron 228M мар 27 11:02 /tmp/test.csv
        -rw-rw-r-- 1 culebron culebron  63M мар 27 11:27 /tmp/test.csv.gz
        -rw-rw-r-- 1 culebron culebron 545M мар 27 11:52 /tmp/test.fgb
        -rw-r--r-- 1 culebron culebron 423M мар 27 11:14 /tmp/test.gpkg

culebron211y ago

Still CSV is 2x smaller than GPKG with this kind of data. And CSV.gz is 7x smaller.

1 more reply

culebron211y ago

...but this has spared me today some irritation at work. Thanks!

j / k navigate · click thread line to collapse

0 comments

culebron212y ago

And here's my quick benchmark, dataset from my full-time job:

  > import geopandas as gpd
  > import pandas as pd
  > from shapely.geometry import Point

  > d = pd.read_csv('data/tracks/2024_01_01.csv')
  > d.shape
  (3690166, 4)
  > list(d)
  ['user_id', 'timestamp', 'lat', 'lon']

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv')
  14.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > d2 = gpd.GeoDataFrame(d.drop(['lon', 'lat'], axis=1), geometry=gpd.GeoSeries([Point(*i) for i in d[['lon', 'lat']].values]), crs=4326)
  > d2.shape, list(d2)
  ((3690166, 3), ['user_id', 'timestamp', 'geometry'])

  > %%timeit -n 1
  > d2.to_file('/tmp/test.gpkg')
  4min 32s ± 7.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv.gz')
  37.4 s ± 291 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > ls -lah /tmp/test*
  -rw-rw-r-- 1 culebron culebron 228M мар 26 21:10 /tmp/test.csv
  -rw-rw-r-- 1 culebron culebron  63M мар 26 22:03 /tmp/test.csv.gz
  -rw-r--r-- 1 culebron culebron 423M мар 26 21:58 /tmp/test.gpkg

CSV saved in 15s, GPKG in 272s. 18x slowdown.

For such datasets, use CSV, don't abuse indexed formats. (Unless you store it for a long time and actually use the index for spatial search, multiple times.)

kylebarronOP1y ago

Your issue is that you're using the default (old) binding to GDAL, based on Fiona [0].

[0]: https://github.com/Toblerity/Fiona

[1]: https://github.com/geopandas/pyogrio

[2]: https://geopandas.org/en/stable/docs/reference/api/geopandas...

[3]: https://github.com/geopandas/pyogrio/pull/346

culebron211y ago

CSV is still faster than geo-formats with pyogrio. From what I saw, it writes most of the file quickly, then spends a lot of time, I think, building the index.

        > %%timeit -n 1
        > d.to_csv('/tmp/test.csv')
        10.8 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d2.to_file('/tmp/test.gpkg', engine='pyogrio')
        1min 15s ± 5.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d.to_csv('/tmp/test.csv.gz')
        35.3 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d2.to_file('/tmp/test.fgb', driver='FlatGeobuf', engine='pyogrio')
        19.9 s ± 512 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > ls -lah /tmp/test*
        -rw-rw-r-- 1 culebron culebron 228M мар 27 11:02 /tmp/test.csv
        -rw-rw-r-- 1 culebron culebron  63M мар 27 11:27 /tmp/test.csv.gz
        -rw-rw-r-- 1 culebron culebron 545M мар 27 11:52 /tmp/test.fgb
        -rw-r--r-- 1 culebron culebron 423M мар 27 11:14 /tmp/test.gpkg

culebron211y ago

Still CSV is 2x smaller than GPKG with this kind of data. And CSV.gz is 7x smaller.

1 more reply

culebron211y ago

...but this has spared me today some irritation at work. Thanks!

j / k navigate · click thread line to collapse