Parquet-WASM: Rust-based WebAssembly bindings to read and write Parquet data (opens in new tab)

(github.com)

179 pointskylebarron2y ago14 comments

14 comments

m_d_2y ago

I'd like to point out that fastparquet has been built for wasm (pydide/pyscript) for some time and works fine, producing pandas dataframes. Unfortunately, the thread/socket/async nature of fsspec means you have to get the files yourself into the "local filesystem" (meaning: the wasm sandbox). (I am the fastparquet author)

jasonjmcghee2y ago

Seeing as the popular alternative here would be DuckDB-WASM, which (last time I checked) is on the order of 50MB, this is comparatively super lightweight.

leeoniya2y ago

i think duckdb-wasm is closer to 6MB over wire, but ~36MB once decompressed. (see net panel when loading https://shell.duckdb.org/)

the decompressed size should be okay since it's not the same as parsing and JITing 36MB of JS.

leeoniya2y ago

in my [albeit outdated] experience ArrowJS is quite a bit slower than using native JS types. i feel like crossing the WASM<>JS boundary is very expensive, especially for anything other than numbers/typed arrays.

what are people's experiences with this?

kylebarronOP2y ago

Arrow JS is just ArrayBuffers underneath. You do want to amortize some operations to avoid unnecessary conversions. I.e. Arrow JS stores strings as UTF-8, but native JS strings are UTF-16 I believe.

Arrow is especially powerful across the WASM <--> JS boundary! In fact, I wrote a library to interpret Arrow from Wasm memory into JS without any copies [0]. (Motivating blog post [1])

[0]: https://github.com/kylebarron/arrow-js-ffi

[1]: https://observablehq.com/@kylebarron/zero-copy-apache-arrow-...

lmeyerov2y ago

Yeah, we built it to essentially stream columnar record batches from server GPUs to browser GPUs with minimal touching of any of the array buffers. It was very happy-path for that kind of fast bulk columnar processing, and we donated it to the community to grow to use cases beyond that. So it sounds like the client code may have been doing more than that.

For high performance code, I'd have expected overhead in %s, not Xs. And not surprised to hear slowdowns for any straying beyond that -- cool to see folks have expanded further! More recently, we've been having good experiences more recently here in Perspective <-arrow-> Loaders, enough so that we haven't had to dig deeper. Our current code is targeting < 24 FPS, as genAI data analytics is more about bigger volumes than velocity, so unsure. However, it's hard to imagine going much faster though given it's bulk typed arrays without copying, especially on real code.

domoritz2y ago

One of the ArrowJS committers here. We have fixed a few significant performance bottlenecks over the last few versions so try again. Also, I'm also ways curious to see specific use cases that are slow so we can make ArrowJS even better. Some limitations are fundamental and you may be better off converting to the corresponding JS types (which should be fast).

leeoniya2y ago

it's been about 4 years, but in Grafana at the time we were using something like ArrowJS + Arrow Flight + protobuf.js to render datasets into dashboards on Canvas, especially for streaming at ~20hz.

when i benchmarked the fastest lib to simply run the protobuf decode (https://github.com/mapbox/pbf), it was 5x slower than native JSON parsing in browsers for dataframe-like structures (e.g. a few dozen 2k-long arrays of floats). this is before even hitting any ArrowJS iterators, etc).

Grafana's Go backend uses Arrow dataframes internally, so using the same on the frontend seemed like a logical initial choice back then, but the performance simply didn't pan out.

ingenieroariel2y ago

I'll let Kyle chime in but I tested it a few months ago with millions of polygons on an M2 16GB of RAM laptop and it worked very well.

There is a library by the same author called lonboard that provides the JS bits inside JupyterLab. https://github.com/developmentseed/lonboard

<speculation>I think it is based on the Kepler.gl / Deck.gl data loaders that go straight to GPU from network.</speculation>

FridgeSeal2y ago

@dang we have a mass spam incursion in this comment thread.

seanw4442y ago

It's site-wide.

rubenvanwyk2y ago

Can this read and write Parquet files to S3-compatible storage?

kylebarronOP2y ago

It can read from HTTP urls, but you'd need to manage signing the URLs yourself. On the writing side, it currently writes to an ArrayBuffer, which then you could upload to a server or save on the user's machine.

nickfs2y ago

;8y aiu;khjbvnvxzg;o9

j / k navigate · click thread line to collapse

14 comments

m_d_2y ago

jasonjmcghee2y ago

Seeing as the popular alternative here would be DuckDB-WASM, which (last time I checked) is on the order of 50MB, this is comparatively super lightweight.

leeoniya2y ago

i think duckdb-wasm is closer to 6MB over wire, but ~36MB once decompressed. (see net panel when loading https://shell.duckdb.org/)

the decompressed size should be okay since it's not the same as parsing and JITing 36MB of JS.

leeoniya2y ago

what are people's experiences with this?

kylebarronOP2y ago

Arrow JS is just ArrayBuffers underneath. You do want to amortize some operations to avoid unnecessary conversions. I.e. Arrow JS stores strings as UTF-8, but native JS strings are UTF-16 I believe.

Arrow is especially powerful across the WASM <--> JS boundary! In fact, I wrote a library to interpret Arrow from Wasm memory into JS without any copies [0]. (Motivating blog post [1])

[0]: https://github.com/kylebarron/arrow-js-ffi

[1]: https://observablehq.com/@kylebarron/zero-copy-apache-arrow-...

lmeyerov2y ago

domoritz2y ago

leeoniya2y ago

it's been about 4 years, but in Grafana at the time we were using something like ArrowJS + Arrow Flight + protobuf.js to render datasets into dashboards on Canvas, especially for streaming at ~20hz.

Grafana's Go backend uses Arrow dataframes internally, so using the same on the frontend seemed like a logical initial choice back then, but the performance simply didn't pan out.

ingenieroariel2y ago

I'll let Kyle chime in but I tested it a few months ago with millions of polygons on an M2 16GB of RAM laptop and it worked very well.

There is a library by the same author called lonboard that provides the JS bits inside JupyterLab. https://github.com/developmentseed/lonboard

<speculation>I think it is based on the Kepler.gl / Deck.gl data loaders that go straight to GPU from network.</speculation>

FridgeSeal2y ago

@dang we have a mass spam incursion in this comment thread.

seanw4442y ago

It's site-wide.

rubenvanwyk2y ago

Can this read and write Parquet files to S3-compatible storage?

kylebarronOP2y ago

nickfs2y ago

;8y aiu;khjbvnvxzg;o9

j / k navigate · click thread line to collapse