Distributed NumPy Arrays (opens in new tab)

(matthewrocklin.com)

45 pointsquasiben10y ago7 comments

7 comments

I recently filed a bug on pandas requesting for a save workspace feature (like RData & save image feature)

It was rejected as being unpythonic[1] , even though the base functionality to save a particular data frame is already present.

Can what dask is doing, be adapted to a simple case scenario of saving a workspace snapshot?

[1] https://github.com/pydata/pandas/issues/12381#issuecomment-1...

tomrod10y ago

I've spent some time recently with both dask and distributed. Continuum Analytics has a real gem with Matthew Rocklin! I've found the libraries very intuitive.

math_and_stuff10y ago

This is great! Any guesses as to what is leading to the large reduction times?

hcrisp10y ago

1/3 time is spent in the p_reduce step, and another 1/3 in elemwise. Not exactly sure what those do, but I'm guessing it's related to the reduce-map-reduce steps of evaluating the standard deviation and then dividing the elements by this value. The mean has to be calculated twice in the formula of the z score. It sounds like the client-worker communication mechanism might have extra latency.

I wonder if this would work if the dask arrays are not equal in length, for example if the files were time series of unequal duration.

Also, are there any plans for dask to support distributed numpy functions requiring kernel computation at the array boundaries? For example scipy.signal.lfilt? I believe it would require ghosting or further inter-dask-array communication that is not yet present.

mrocklin10y ago

See http://dask.pydata.org/en/latest/ghost.html

canavandl10y ago

There's no JVM overhead like for Spark computation. The dask array methods use the numpy C-API, which are implemented in C and run on the physical machine.

math_and_stuff10y ago

I think you might have misunderstood my comment; I was referring to the bullet point under "What didn't work":

> Reduction speed: The computation of normalized temperature, z, took a surprisingly long time. I’d like to look into what is holding up that computation.

j / k navigate · click thread line to collapse

7 comments

sandGorgon10y ago

I recently filed a bug on pandas requesting for a save workspace feature (like RData & save image feature)

It was rejected as being unpythonic[1] , even though the base functionality to save a particular data frame is already present.

Can what dask is doing, be adapted to a simple case scenario of saving a workspace snapshot?

[1] https://github.com/pydata/pandas/issues/12381#issuecomment-1...

tomrod10y ago

I've spent some time recently with both dask and distributed. Continuum Analytics has a real gem with Matthew Rocklin! I've found the libraries very intuitive.

math_and_stuff10y ago

This is great! Any guesses as to what is leading to the large reduction times?

hcrisp10y ago

I wonder if this would work if the dask arrays are not equal in length, for example if the files were time series of unequal duration.

mrocklin10y ago

See http://dask.pydata.org/en/latest/ghost.html

canavandl10y ago

There's no JVM overhead like for Spark computation. The dask array methods use the numpy C-API, which are implemented in C and run on the physical machine.

math_and_stuff10y ago

I think you might have misunderstood my comment; I was referring to the bullet point under "What didn't work":

> Reduction speed: The computation of normalized temperature, z, took a surprisingly long time. I’d like to look into what is holding up that computation.

j / k navigate · click thread line to collapse