Btw @rkrzr I hope you don't mind me asking but "niet geschoten is altijd mis": are you looking for freelance assistance at the moment at all? I've been consistently impressed by the engineering work done at Channable, for example the Aho-Corasick work and the stuff relating to compact regions, but am not able to commit to fulltime work at the moment. Lately I've been mostly working with Ruby and DBA/db performance stuff, but Haskell is the first language I got properly good at and it will always have a warm place in my heart. I'm pretty experienced at the intersection between backend dev and DBA/devops type problems and if you are experiencing any problems in that area I would love to help out there. Looking at the channable github page, I just realized I even made made some PRs for Icepeak way back in 2020. :)
We are unfortunately not looking for freelance assistance at the moment, but we regularly have Haskell Engineering positions open, so perhaps that will be interesting to you in the future.
For real world distributed systems back pressure is very important. Otherwise you could DoS yourself by some cascading overload.
Of course, depending on the rest of the system you could still have a bottleneck somewhere. For example: if you pull jobs from a queue in (say) Redis but don't have enough capacity to process jobs faster than they are enqueued, the queue will eventually fill up.
For parallel processing the backpressure becomes a bit more complicated, in particular if you end with a sequential consumer. If you're just doing your parallel tasks when a sequential consumer demands it, you'll find yourself still doing one task at a time. The straightforward solution is to work ahead a bit, spawning just enough parallel work to fill a small buffer that the consumer can read from. You get backpressure this way, but may have done a little bit too much if the consumer doesn't want the rest of the data. We'll show this in the third blog post of the series.
So it's pull based and not push based like most other streams lib.
Does maybe someone know how this compares to FS2 or Iteratees than? (Both are also pull based streaming solutions).
https://en.wikipedia.org/wiki/Iteratee
Looks quite similar to me. Is the Scala FS2 lib maybe even a clones of the Haskell solution? Or are they different in important aspects?
If you had conduits that were too big to fit in memory at once, would you (channable) stream them to local disk (either explicitly or just using virtual memory)? Or would you be able to distribute work between multiple machines with a cluster-aware Conduit type?
Your scheduler could split the job up into multiple machines and run the same Conduit pipeline on all the machines, and only the conduit steps that need to communicate with each other would do so.
Separate question, do you have any actions that produce more output than their input? I could imagine some customers might find it useful to generate the cartesian product of two inputs, or the power set of one input.
Yep! That's currently still the case. We do have some ideas to put caches on disk using mmapped files, so that you have fast access when it all fits in memory but the OS can also drop them when it wants to.
For the moment we just use instances with 128GB memory, and those can still fit the datasets of the biggest customers that we have right now. Datasets go up to 10s of GBs, so it's not really 'big data' that needs to be distributed across machines. Due to the regular occurrence of aggregations (sort/group/window/deduplicate) that require at least _some_ synchronization, we only have small sections that can be completely parallelized. It's already a challenge to use all the cores on a single machine in an efficient manner, and I don't think we'll achieve much by using multiple machines for a single job. We've discussed a lot of this in an earlier blog post here: https://www.channable.com/tech/why-we-decided-to-go-for-the-...
> Separate question, do you have any actions that produce more output than their input?
Yep, we don't have cartesian products but we do have a 'split' action. Typical usage might be that a customer has a "sizes" field with values like "M,L,XL" and that they would split on that field so that a single item becomes three items for the separate sizes. The increase in the number of items is usually limited, and the increase in memory usage is even smaller because at most points during the data processing we only store the changed fields, and refer to the original item for the rest. In these cases multiple items will point to the same original item.
Just because you can dedup a conduit doesn't mean you have to. We use conduits for streaming gigabytes-large files while staying within megabytes of memory use – ensuring that is one of the main selling points of libraries like conduit: https://github.com/snoyberg/conduit#readme