The implementation uses shared pointers to replicate as little data as possible. The memory management is there to maximize the amount of data you can process on a single computer with limited memory. There is a simple zmq wrapper that can be used to link bunch of these in parallel(or any other structure) on multiple computers. The same zmq wrapper can be used for other online learning algorithms. I just didn't implement the others yet.
If there will be more people interested, I could whip up a few test runs and examples.