A quick benchmark [0] shows that saving to GeoPackage, FlatGeobuf, and GeoParquet are roughly 10x faster than saving to CSV. Additionally, the CSV is much larger than any other format.
[0]: https://gist.github.com/kylebarron/f632bbf95dbb81c571e4e64cd...
> import geopandas as gpd
> import pandas as pd
> from shapely.geometry import Point
> d = pd.read_csv('data/tracks/2024_01_01.csv')
> d.shape
(3690166, 4)
> list(d)
['user_id', 'timestamp', 'lat', 'lon']
> %%timeit -n 1
> d.to_csv('/tmp/test.csv')
14.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> d2 = gpd.GeoDataFrame(d.drop(['lon', 'lat'], axis=1), geometry=gpd.GeoSeries([Point(*i) for i in d[['lon', 'lat']].values]), crs=4326)
> d2.shape, list(d2)
((3690166, 3), ['user_id', 'timestamp', 'geometry'])
> %%timeit -n 1
> d2.to_file('/tmp/test.gpkg')
4min 32s ± 7.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit -n 1
> d.to_csv('/tmp/test.csv.gz')
37.4 s ± 291 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> ls -lah /tmp/test*
-rw-rw-r-- 1 culebron culebron 228M мар 26 21:10 /tmp/test.csv
-rw-rw-r-- 1 culebron culebron 63M мар 26 22:03 /tmp/test.csv.gz
-rw-r--r-- 1 culebron culebron 423M мар 26 21:58 /tmp/test.gpkg
CSV saved in 15s, GPKG in 272s. 18x slowdown.I guess your dataset is countries borders, isn't it? Something that 1) has few records and makes a small r-tree, and 2) contains linestrings/polygons that can be densified, similar to Google Polyline algorithm.
But a lot of geospatial data is just sets of points. For instance: housing per entire country (couple of million points). Address database (IIRC 20+M points). Or GPS logs of multiple users, received from logging database, ordered by time, not assembled in tracks -- several million per day.
For such datasets, use CSV, don't abuse indexed formats. (Unless you store it for a long time and actually use the index for spatial search, multiple times.)
You need to use pyogrio [1], its vectorized counterpart, instead. Make sure you use `engine="pyogrio"` when calling `to_file` [2]. Fiona does a loop in Python, while pyogrio is exclusively compiled. So pyogrio is usually about 10-15x faster than fiona. Soon, in pyogrio version 0.8, it will be another ~2-4x faster than pyogrio is now [3].
[0]: https://github.com/Toblerity/Fiona
[1]: https://github.com/geopandas/pyogrio
[2]: https://geopandas.org/en/stable/docs/reference/api/geopandas...
> %%timeit -n 1
> d.to_csv('/tmp/test.csv')
10.8 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit -n 1
> d2.to_file('/tmp/test.gpkg', engine='pyogrio')
1min 15s ± 5.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit -n 1
> d.to_csv('/tmp/test.csv.gz')
35.3 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit -n 1
> d2.to_file('/tmp/test.fgb', driver='FlatGeobuf', engine='pyogrio')
19.9 s ± 512 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> ls -lah /tmp/test*
-rw-rw-r-- 1 culebron culebron 228M мар 27 11:02 /tmp/test.csv
-rw-rw-r-- 1 culebron culebron 63M мар 27 11:27 /tmp/test.csv.gz
-rw-rw-r-- 1 culebron culebron 545M мар 27 11:52 /tmp/test.fgb
-rw-r--r-- 1 culebron culebron 423M мар 27 11:14 /tmp/test.gpkg