TBH, you're claims sound like you've just "discovered" message-passing, of which many, many languages, runtimes and operating systems have been using for many years/decades. (https://en.wikipedia.org/wiki/Message_passing)
In other words... its not a revolution.
ZProc seems to simply be a simple library to pickle data structures thru a central (pubsub?) server.
This is not the way to get remotely close to "high performance". What you've created here is pretty much what multiprocessing gives you already in a more performant solution (i.e. no zeromq involved).
Minor point of pedantry which I'll state because it's an often-overlooked timesaver for folks developing on multiprocessing: not only is MP potentially faster for transferring data between processes compared to this solution, but it can also be way, way faster in situations where you have all your data before creating your processes/pool and just want to farm it out to your MP processes without waiting for it all to be chunked/pickled/unpickled.
Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant* time, if the data's already present in e.g. a global when children are created.
This pattern can be used to totally bypass all considerations of performance/CPU/etc. for pickling/unpickling data and lends a massive speed boost in certain situations--e.g. a massive dataset is read into memory at startup, and then ranges of that dataset are processed in parallel by a pool of MP processes, each of which will return a relatively small result-set back to the parent, or each of which will write its processed (think: data scrubbing) range to a separate file which could be `cat`ed together, or written in parallel with careful `seek` bookkeeping.
Unix-ish OSes only, though (unless the fork() emulation in WSL works for this--I have not tested that).
* Technically it's O(N) for the size of data you have in memory at process pool start, because fork() can take time, but the multiplier is small enough in practice compared to sending data to/from MP processes via queues or whatever that it might as well be constant.
Note that this works for big objects, but not for small objects. E.g. if you fork-share a large list of integers or dicts or something like that, then you don't get any memory usage benefits, because every access will cause a refcount-write and that will copy the whole page containing the object.
> * Technically it's O(N) for the size of data you have in memory at process pool start
It's not quite that simple; sharing n pages can take very little time or a bit more time; it depends on how the pages are mapped; sharing a large mapping doesn't take longer than a small mapping.
Very true; I went into some more detail about my typical use case above. Using MP for lots of small objects that you've already extracted from raw data/IO/whatever is a game of diminishing returns. It's in situations like that where traditional shared-memory starts looking more and more attractive. When I get to that point, while multiprocessing and some other packages provide a few nice abstractions over shmem, I start looking for other platforms than Python.
> It's not quite that simple; sharing n pages can take very little time or a bit more time
Definitely; I was simplifying in order to compare the overhead of fork with the overhead of pickling/shipping/unpickling data. Sharing large pieces of data with even very slow fork()ing is, in my experience, so much faster than the [de]serialize approach that it is effectively constant in comparison, but I didn't mean to discount the complexities of what make certain forking situations faster/slower than others.
Have you tried this or got it working ? The fly in the ointment is the reference count. Add a reference and BOOM you suddenly have a huge copy. It can be made to work efficiently in certain cases but takes a lot of care.
Most of the situations where I care enough about memory and/or pickling overhead fall into the "take a giant block of binary/string data and process ranges of it in parallel" family, in which case there aren't too many references until the subprocesses get to work. If I had more complex structures of data I'd probably get a little less performance bang for my buck, but even then I suspect it would be much faster than multiprocessing's strategy: pickling and sending data between processes via pipes is many times slower than moving the equivalent amount of data by dirty-writing pages into a forked child.
That's not meant to discount anything y'all are saying, though: refcounts are definitely a very important thing to be mindful of in this situation. A child comment suggests gc.freeze, which can help, but can't entirely save you from thinking about this stuff.
It's also very important to be mindful of what happens with your program at shutdown: if you have a big set of references shared via fork(), and all your children shut down around the same time, your memory usage can shoot up as each child tries to de-refcount all objects in scope. This applies even if each child was only operating on a subset of the references shared to it. If you're processing, say, 1GB of data from the parent in 8 children on a 4 core system (doing M>N(cpu) because e.g. children spend some time writing results out to the FS/network), a near-simultaneous shutdown could allocate 9GB of memory in the very worst case, which can cause OOM or unexpected swapping behavior. Throttled shutdowns using a semaphore or equivalent are the way to go in that case.
In my workload that's exactly when it hits.
We ran into this when sharing different parts of a huge matrix with different workers. We had to be extra careful that we did not create new references in the subprocesses. We were operating at scale where if we got it wrong OOM will kill us.
Working with memory mapped arrays are more forgiving.
In fact, I think Performance centric development is a lesser known evil.
> have all your data before creating your processes/pool
Zproc exposes the required API for this (Nothing new, just the python API) :)
https://zproc.readthedocs.io/en/latest/api.html#zproc.Proces... (args and kwargs)
> a massive dataset
Wouldn't you be better off using a Database for that kind of work?
> Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant
timeAny resources on how to implement that?
big_data = read_huge_binary_or_string()
def process_range(rng):
start, end = rng
do_something(big_data[start:end])
pool = multiprocesing.Pool(2)
pool.map(process_range, [
(0, 10000),
(10001, len(big_data),
])The `multiprocessing.Pool` uses a `multiprocessing.Queue` in the background to retrieve the results after completion.
The `multiprocessing.Queue` in turn uses `multiprocessing.connection.Pipe` and sends the pickled objects over to the wire.
So I don't see how this is any better than ZMQ.
Just because stuff has an API that doesn't look like message passing doesn't mean it can't be doing that in the background. Which is funny, because that's the whole point of ZProc.
I realize the subtle difference that Cpython uses pipes, not sockets, unlike ZMQ. But that doesn't really make a difference now, does it?
Proof:
Process Pool worker, returning the result by using `outqueue.put()`
https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...
multiprocessing Queue, initializing a Pipe
https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...
multiprocessing Queue serializing data to send it using that Pipe
https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...
I thought you were talking about sending data to child processes in constant* time, while it was running.
I never claimed it to be performant!
"Above all, ZProc is written for safety and the ease of use."
(Read here - https://github.com/pycampers/zproc?files=1#faq)
> It's not a revolution
I totally agree. It's just a better way of doing things zmq already perfected. Like, tell me if you've ever seen a python object that has a `dict` API, but does message passing in the background.
> central (pubsub?) server.
Central server, yes. It uses PUB-SUB for state watching and REQ-REP for everything else.
> you've just "discovered" message-passing
Guess you're right? 2 years is a peanut on the time scale...
P.S. Thanks for all the feedback, I've been dying to hear something for a while now.
Don't get me wrong, message-passing has some advantages, but they certainly aren't that it 'solves' parallelism. If you wish to know more, investigate:
- Smalltalk and Erlang (for message passing languages).
- QNX (for a message-passing OS)
- mpiPY (for a message-passing Python library, mpi is the grandfather of message passing libraries that runs everywhere).
- Occam & the transputer for an example of a hardware-mp implementation (actually its Communicating Sequential Processes, but for your purposes it would be enlightening).
- golang for a modern-day implementation of CSP.
- Python implementation of CSP (https://github.com/futurecore/python-csp)
- Discussion about MP (http://wiki.c2.com/?MessagePassingConcurrency, for more just google it)
Basically, its great that you want to learn about concurrency & parallelism, but you've come to a gun fight with a blunt butter knife.
If you could point out some stuff from ZProc's page, that would be nice!
> mpi is the grandfather of message passing libraries
Never heard of it before, but just a simple google search reveals that it _might_ be more performant than zmq, but not as fault-tolerant and flexible. It really looks like a niche thing, from this comment by peter hintjens
> Why smart cloud builders are betting everything on 0MQ. In detail, compare to the alternatives. Hand-rolling your own TCP stack is insane. Using any broker-based product won't scale. Buying licenses from IBM or TIBCO would eat up your capital. Supercomputing products like MPI aren't designed for this scale. There is literally no alternative.
(http://zeromq.org/docs:the-ten-minute-talk)
> Don't get me wrong, message-passing has some advantages, but they certainly aren't that it 'solves' parallelism.
Doesn't it? (For most people)
---
I can't believe I'm hearing words against zmq on HN, its wierd.
Even the guys over at Dask settled on ZMQ over anything - https://github.com/dask/distributed/issues/776
P.S. Seems like you know quite a lot about this topic. Do you have any projects of your own that I can see?
Bottom line, I think most people would be happy doing message passing parallelism in the real world. Sure, it doesn't look that good in theory but works damn good in practicality.
...also, nanomsg is the 'improved' successor.
Also, MPI isn't a 'niche' thing, its the way that a large proportion of high-performance applications have been implemented for a few decades (think Crays & weather prediction). Zeromq has a few simple web-apps using it (I exagerate slightly).