Here's a very quick alternative
push all messages into a channel IN.
A goroutine reads from IN inserts into a sorted TREE.
same goroutine reads max from TREE and pushes to channel OUT.
workers read from OUT.
This is a pretty mediocre implementation. But I post it because the numbers posted in this article were so low. I think we can do better with something far less sophisticated.https://gist.github.com/fmstephe/4fdc930ff180be3e92c693ad5a2...
I timed that to write 600,000 messages concurrently and read 600,000 sequentially in 1.5 seconds. Not a perfect measurement but I'm on a train and I'm about to get to my stop.
If we need really high performance we can start substituting faster, and simpler, queues than Go's channels and we can very likely improve that red-black tree that is doing the sorting. Each of these components is simpler, more likely correct, and from what I can see substantially faster than the sophisticated lock-free queue described in the article.
I don't want to come across as snarky. But I am surprised by the popularity of linked-list based lock-free data structures like this.
EDIT: So the tone of my post came across nastier than I really intend. I quite enjoyed the article, I'd be interested how far the performance can be improved.
For example, the following:
q := NewPrioq()
q.In <- timestamp(20)
q.In <- timestamp(10)
q.In <- timestamp(30)
o := <-q.Out
fmt.Println("Read", o)
Prints "Read 20" rather than "Read 30".Furthermore, this input:
q := NewPrioq()
for i := 0; i < 202; i++ {
q.In <- timestamp(i)
}
o := <-q.Out
fmt.Println("Read", o)
will result in a deadlock.As far as I can tell, this code behaves like a FIFO queue with capacity 201. Each iteration of the loop in Prioq.run() will read once from the input channel, insert into the empty tree, remove the "max" element from the single item tree, and then insert into the output channel.
Reading from the input channel and writing to the output channel are performed in lockstep rather than as needed. A correct implementation of this pattern would need to use a select block and to avoid the deadlock and would need a way for writes to the output channel to be signaled as needed so to avoid the stale max value problem.
But the purpose of that code was just to indicate the rough performance that was possible using much less sophisticated parts, compared to a lock-free linked list with sorting built in.
But, thinking about it further it does seem like solving the 'stale max value' problem would be non-trivial.
If I was implementing this for a real system I would side-step that problem by redefining the requirements (which could be seen as cheating).
We can agree that the Out channel is full of stale max values and isn't really well sorted. But it will only have len(out) many stale values, and the values which replace them as Out drains are reasonably well ordered. At any time we have at most len(out) out of order tasks.
I am thinking of the queue in two different scenarios.
1: Workers are keeping up with new tasks - this queue provides FIFOish semantics and no effective priority ordering. But it doesn't matter, workers are keeping up prio is irrelevant.
2: Workers are falling behind. The red-black tree fills up as Out is filled to max capacity. As Out drains new tasks are in priority order (roughly).
Because of the vagueness of the guarantees given above you'd need to think hard about whether that provides enough. But I would strongly prefer something along these lines (i.e. avoid lock-free linked list with built in sorting) if I thought I could live with it.
What do you think?
Channels end up locking under the hood as well, so from a purity standpoint I wanted to avoid those...if channels are indeed faster, then clearly I need to rework my approach.
Even "lock free" algorithms can still not scale due to contention. Read this paper for a good summary:
http://queue.acm.org/detail.cfm?id=2991130
In short, a queue is made up of a linked list, where each entry is itself a circular buffer. Insertions go into a circular buffer... unless there's contention, in which case the next circular buffer is used.
The result is that insertions are mostly not contented. Removals are mostly not contented. And it scales very well.
https://github.com/fmstephe/flib/tree/master/queues/spscq
Those are only single producer single consumer (spsc) so not flexible enough for your needs right now, but I am working on expanding into multiproducer/multiconsumer variants which could be a good fit.
The single producer/consumer queues get up to 100 million messages per second (on microbenchmarks) and I expect the multi variants to be above 10-20 million per second.
I also have a handbuilt redblack tree which does around 10 million inserts/pops per second
https://github.com/fmstephe/matching_engine/tree/master/matc...
Although that is very specialised for another purpose, I would be happy to try stripping it down if you wanted that. But, start with the LLRB in that gist.
That said, it is always nice to see new data-structures and I really wish there were more posts like this. I always like to see the different ways that people try to solve these problems.
Average runtime of insertion is constant time because the size of the linked list does not grow beyond a certain size, and inserting into the priority queue is constant. The worst case is not infinity because there is always at least one goroutine making progress.
Thank you for the comment about wanting to see posts like this :)
I do not think this implementation is flawed from the standpoint that at least one goroutine is making progress, there are no locks, and results are returned in order if dequeues are not saturated
This is not something that would make it to a white paper because the implementation does not return deterministic results, and I could not explain from a mathematical standpoint how flawed the results are returned because they're not guaranteed to be in order.
I should have also included the profiling results. Currently 25% of time is spent garbage collecting, which is expected because the priority queues effectively use multi-version concurrency control, but this will come in handy when I want to persist data to the hard drive with as little locking as possible (and for general simplicity in avoiding corruption)
Another 15% or so is spent sleeping, which would be the spinning loop part.
Otherwise I think it would be awesome to see profiling results, I suspect that the garbage collection time can be reduced significantly. I have recently been implementing a locking thread-safe queue, and I think it would be interesting to compare the differences in speed and garbage produced, just to get a better idea of the pros and cons. I suspect if you shortened the retry loop you have, your implementation would be faster.
When I said flawed, I meant that it was not really enforcing order on insertion, but for many applications, this doesn't matter that much. Close enough is often good enough(and the probability of this going wrong is fairly small, except in extreme cases).
Thanks for posting!
Oh, you've meant the comment police from the ministry of truth will kill his karma, yes they surely will. They always do.
PS: Down voting below zero is not moderation it's punishment, and yes, I know they enjoy it.
I notice that in about 80% of your comments on your comment history you mention how old you are. Maybe you're just getting grumpy?