Thread-Safe Lock Free Priority Queues in Golang (opens in new tab)

(scottlobdell.me)

92 pointsslobdell9y ago30 comments

30 comments

It is interesting to me how popular linked lists are for non-blocking data structures. They are often slow and always so hard to reason about and implement safely.

Here's a very quick alternative

    push all messages into a channel IN.
    A goroutine reads from IN inserts into a sorted TREE.
    same goroutine reads max from TREE and pushes to channel OUT.
    workers read from OUT.

This is a pretty mediocre implementation. But I post it because the numbers posted in this article were so low. I think we can do better with something far less sophisticated.

https://gist.github.com/fmstephe/4fdc930ff180be3e92c693ad5a2...

I timed that to write 600,000 messages concurrently and read 600,000 sequentially in 1.5 seconds. Not a perfect measurement but I'm on a train and I'm about to get to my stop.

If we need really high performance we can start substituting faster, and simpler, queues than Go's channels and we can very likely improve that red-black tree that is doing the sorting. Each of these components is simpler, more likely correct, and from what I can see substantially faster than the sophisticated lock-free queue described in the article.

I don't want to come across as snarky. But I am surprised by the popularity of linked-list based lock-free data structures like this.

EDIT: So the tone of my post came across nastier than I really intend. I quite enjoyed the article, I'd be interested how far the performance can be improved.

Saltor9y ago

I don't think that the implementation here performs correctly...

For example, the following:

    q := NewPrioq()
    q.In <- timestamp(20)
    q.In <- timestamp(10)
    q.In <- timestamp(30)
    o := <-q.Out
    fmt.Println("Read", o)

Prints "Read 20" rather than "Read 30".

Furthermore, this input:

    q := NewPrioq()
    for i := 0; i < 202; i++ {
        q.In <- timestamp(i)
    }
    o := <-q.Out
    fmt.Println("Read", o)

will result in a deadlock.

As far as I can tell, this code behaves like a FIFO queue with capacity 201. Each iteration of the loop in Prioq.run() will read once from the input channel, insert into the empty tree, remove the "max" element from the single item tree, and then insert into the output channel.

Reading from the input channel and writing to the output channel are performed in lockstep rather than as needed. A correct implementation of this pattern would need to use a select block and to avoid the deadlock and would need a way for writes to the output channel to be signaled as needed so to avoid the stale max value problem.

fmstephe9y ago

You are right.

But the purpose of that code was just to indicate the rough performance that was possible using much less sophisticated parts, compared to a lock-free linked list with sorting built in.

But, thinking about it further it does seem like solving the 'stale max value' problem would be non-trivial.

If I was implementing this for a real system I would side-step that problem by redefining the requirements (which could be seen as cheating).

We can agree that the Out channel is full of stale max values and isn't really well sorted. But it will only have len(out) many stale values, and the values which replace them as Out drains are reasonably well ordered. At any time we have at most len(out) out of order tasks.

I am thinking of the queue in two different scenarios.

1: Workers are keeping up with new tasks - this queue provides FIFOish semantics and no effective priority ordering. But it doesn't matter, workers are keeping up prio is irrelevant.

2: Workers are falling behind. The red-black tree fills up as Out is filled to max capacity. As Out drains new tasks are in priority order (roughly).

Because of the vagueness of the guarantees given above you'd need to think hard about whether that provides enough. But I would strongly prefer something along these lines (i.e. avoid lock-free linked list with built in sorting) if I thought I could live with it.

What do you think?

slobdellOP9y ago

thanks for posting...I will try the implementation you linked to and see the effects on performance.

Channels end up locking under the hood as well, so from a purity standpoint I wanted to avoid those...if channels are indeed faster, then clearly I need to rework my approach.

adekok9y ago

Locking isn't the problem. Contention is the problem.

Even "lock free" algorithms can still not scale due to contention. Read this paper for a good summary:

http://queue.acm.org/detail.cfm?id=2991130

In short, a queue is made up of a linked list, where each entry is itself a circular buffer. Insertions go into a circular buffer... unless there's contention, in which case the next circular buffer is used.

The result is that insertions are mostly not contented. Removals are mostly not contented. And it scales very well.

fmstephe9y ago

If you want to push the performance of that approach (a single goroutine prioritising your messages for you a queue in and a queue out) I have built a set of lock-free queues on a ringbuffer.

https://github.com/fmstephe/flib/tree/master/queues/spscq

Those are only single producer single consumer (spsc) so not flexible enough for your needs right now, but I am working on expanding into multiproducer/multiconsumer variants which could be a good fit.

The single producer/consumer queues get up to 100 million messages per second (on microbenchmarks) and I expect the multi variants to be above 10-20 million per second.

I also have a handbuilt redblack tree which does around 10 million inserts/pops per second

https://github.com/fmstephe/matching_engine/tree/master/matc...

Although that is very specialised for another purpose, I would be happy to try stripping it down if you wanted that. But, start with the LLRB in that gist.

jasonwatkinspdx9y ago

My experience is that you can lock and unlock a golang mutex in around 130ns, do a cas in around 60ns, and a fetch and increment in around 30ns. So building something atop the atomic primitives is potentially faster, but the difference is not so dramatic that you shouldn't try a simple implementation first.

1 more reply

jalfresi9y ago

That's a really interesting approach, thanks for posting

theaustinseven9y ago

These queues are thread safe and lock-free, but they are far from efficient. My biggest issue here is that the runtime evaluations are wrong. The insert operation is definitely an average runtime of N (where N is the number of elements in the queue) and a worst case of infinity(since the insert operation is restarted if a race condition occurs, an insert operation could be repeated forever). I don't know if I missed anything, so I could be wrong, but this implementation seems incredibly flawed.

That said, it is always nice to see new data-structures and I really wish there were more posts like this. I always like to see the different ways that people try to solve these problems.

jasonwatkinspdx9y ago

Yeah, something's up with the OP's numbers/design. I didn't dig into it further because I found the writing style off-putting. A basic lock free queue or linked list in golang should run at 10+ million ops per second. I suspect even with the parallelization the priority queue implementation is spending too much time reorganizing. A skiplist of 64 byte or 4KB buckets would probably be both simpler and faster.

slobdellOP9y ago

I agree, these are currently far from efficient (I hope to resolve this). My suspicion right now is that a lot of time is being spent somewhere in spinning loops waiting for progress to be made.

Average runtime of insertion is constant time because the size of the linked list does not grow beyond a certain size, and inserting into the priority queue is constant. The worst case is not infinity because there is always at least one goroutine making progress.

Thank you for the comment about wanting to see posts like this :)

I do not think this implementation is flawed from the standpoint that at least one goroutine is making progress, there are no locks, and results are returned in order if dequeues are not saturated

This is not something that would make it to a white paper because the implementation does not return deterministic results, and I could not explain from a mathematical standpoint how flawed the results are returned because they're not guaranteed to be in order.

I should have also included the profiling results. Currently 25% of time is spent garbage collecting, which is expected because the priority queues effectively use multi-version concurrency control, but this will come in handy when I want to persist data to the hard drive with as little locking as possible (and for general simplicity in avoiding corruption)

Another 15% or so is spent sleeping, which would be the spinning loop part.

theaustinseven9y ago

The insert is still an N operation since N in this case is the number of nodes in the list. It would only be O(1) if the list was of constant size(and even that would be misleading). The worst case is still infinity, because you are concerned with a single insert operation, and not that there are some making progress. We are merely concerned with the progress that a given insert operation makes, and there could potentially be an infinite loop. You can reduce the probability of this happening by shortening the loop between repeating the CAS operation.

Otherwise I think it would be awesome to see profiling results, I suspect that the garbage collection time can be reduced significantly. I have recently been implementing a locking thread-safe queue, and I think it would be interesting to compare the differences in speed and garbage produced, just to get a better idea of the pros and cons. I suspect if you shortened the retry loop you have, your implementation would be faster.

When I said flawed, I meant that it was not really enforcing order on insertion, but for many applications, this doesn't matter that much. Close enough is often good enough(and the probability of this going wrong is fairly small, except in extreme cases).

bjacokes9y ago

I get that this is dry subject matter and some humor helps to lighten it up, but I found the writing style distracting. Some paragraphs are dense with algorithm details, others just have a picture of a tech office, a story about running into someone on BART, or descriptions of data structures as "lame" or "losers". Made it really tough to retrace my steps and get context from the preceding section.

Animats9y ago

Yes. I'm starting to call this "neckbeard writing".

pfarnsworth9y ago

I don't think neckbeards have 24" arms (as per OP's site).

slobdellOP9y ago

thanks!

Animats9y ago

Can you write lock-free code in Go without assembly language support? Sometimes you need fence instructions or hardware compare-and-swap.

jasonwatkinspdx9y ago

Yes. Portable primitives are provided in the sync.atomic package. They've been careful about the details (eg, inserting fences on architectures like arm). Most applications shouldn't touch this stuff but it's there if you want to try to write a lock free data structure or algorithm.

mappu9y ago

There are some interesting caveats to using sync.atomic. Some working code on x86_64 did panic on x86_32 since the target struct member was no longer 64-bit aligned.

jasonwatkinspdx9y ago

Yeah. I believe the API docs do call that out. There's not much that can be done about it either, short of forcing 8 byte alignment for all golang objects on 32bit platforms.

1 more reply

fmstephe9y ago

mappu, do you have a need for 64-bit aligned data?

I would be interested if you did. I have a repo that provides a set of packages for doing things like that. If that is a real need I would be interested to add it to the library.

1 more reply

jalfresi9y ago

Hats off, this is a GREAT post! Lots of detail and I'm way out of my depth but lots to dig into and learn from!

Thanks for posting!

happytrails9y ago

I didn't make it past the bar pic :(

employee80009y ago

Comments like this are NOT appreciated on Hacker News, especially if you haven't taken the time to read the article.

softawre9y ago

Really? This seems like (potentially poorly worded) feedback that might be useful to the author.

_Codemonkeyism9y ago

Good that you know what is appreciated on Hacker News to tell us.

Oh, you've meant the comment police from the ministry of truth will kill his karma, yes they surely will. They always do.

PS: Down voting below zero is not moderation it's punishment, and yes, I know they enjoy it.

oneloop9y ago

I appreciated it.

I notice that in about 80% of your comments on your comment history you mention how old you are. Maybe you're just getting grumpy?

j / k navigate · click thread line to collapse

30 comments

fmstephe9y ago

It is interesting to me how popular linked lists are for non-blocking data structures. They are often slow and always so hard to reason about and implement safely.

Here's a very quick alternative

    push all messages into a channel IN.
    A goroutine reads from IN inserts into a sorted TREE.
    same goroutine reads max from TREE and pushes to channel OUT.
    workers read from OUT.

This is a pretty mediocre implementation. But I post it because the numbers posted in this article were so low. I think we can do better with something far less sophisticated.

https://gist.github.com/fmstephe/4fdc930ff180be3e92c693ad5a2...

I timed that to write 600,000 messages concurrently and read 600,000 sequentially in 1.5 seconds. Not a perfect measurement but I'm on a train and I'm about to get to my stop.

I don't want to come across as snarky. But I am surprised by the popularity of linked-list based lock-free data structures like this.

EDIT: So the tone of my post came across nastier than I really intend. I quite enjoyed the article, I'd be interested how far the performance can be improved.

Saltor9y ago

I don't think that the implementation here performs correctly...

For example, the following:

    q := NewPrioq()
    q.In <- timestamp(20)
    q.In <- timestamp(10)
    q.In <- timestamp(30)
    o := <-q.Out
    fmt.Println("Read", o)

Prints "Read 20" rather than "Read 30".

Furthermore, this input:

    q := NewPrioq()
    for i := 0; i < 202; i++ {
        q.In <- timestamp(i)
    }
    o := <-q.Out
    fmt.Println("Read", o)

will result in a deadlock.

fmstephe9y ago

You are right.

But the purpose of that code was just to indicate the rough performance that was possible using much less sophisticated parts, compared to a lock-free linked list with sorting built in.

But, thinking about it further it does seem like solving the 'stale max value' problem would be non-trivial.

If I was implementing this for a real system I would side-step that problem by redefining the requirements (which could be seen as cheating).

I am thinking of the queue in two different scenarios.

1: Workers are keeping up with new tasks - this queue provides FIFOish semantics and no effective priority ordering. But it doesn't matter, workers are keeping up prio is irrelevant.

2: Workers are falling behind. The red-black tree fills up as Out is filled to max capacity. As Out drains new tasks are in priority order (roughly).

What do you think?

slobdellOP9y ago

thanks for posting...I will try the implementation you linked to and see the effects on performance.

Channels end up locking under the hood as well, so from a purity standpoint I wanted to avoid those...if channels are indeed faster, then clearly I need to rework my approach.

adekok9y ago

Locking isn't the problem. Contention is the problem.

Even "lock free" algorithms can still not scale due to contention. Read this paper for a good summary:

http://queue.acm.org/detail.cfm?id=2991130

The result is that insertions are mostly not contented. Removals are mostly not contented. And it scales very well.

fmstephe9y ago

If you want to push the performance of that approach (a single goroutine prioritising your messages for you a queue in and a queue out) I have built a set of lock-free queues on a ringbuffer.

https://github.com/fmstephe/flib/tree/master/queues/spscq

The single producer/consumer queues get up to 100 million messages per second (on microbenchmarks) and I expect the multi variants to be above 10-20 million per second.

I also have a handbuilt redblack tree which does around 10 million inserts/pops per second

https://github.com/fmstephe/matching_engine/tree/master/matc...

Although that is very specialised for another purpose, I would be happy to try stripping it down if you wanted that. But, start with the LLRB in that gist.

jasonwatkinspdx9y ago

1 more reply

jalfresi9y ago

That's a really interesting approach, thanks for posting

theaustinseven9y ago

That said, it is always nice to see new data-structures and I really wish there were more posts like this. I always like to see the different ways that people try to solve these problems.

jasonwatkinspdx9y ago

slobdellOP9y ago

I agree, these are currently far from efficient (I hope to resolve this). My suspicion right now is that a lot of time is being spent somewhere in spinning loops waiting for progress to be made.

Thank you for the comment about wanting to see posts like this :)

I do not think this implementation is flawed from the standpoint that at least one goroutine is making progress, there are no locks, and results are returned in order if dequeues are not saturated

Another 15% or so is spent sleeping, which would be the spinning loop part.

theaustinseven9y ago

bjacokes9y ago

Animats9y ago

Yes. I'm starting to call this "neckbeard writing".

pfarnsworth9y ago

I don't think neckbeards have 24" arms (as per OP's site).

slobdellOP9y ago

thanks!

Animats9y ago

Can you write lock-free code in Go without assembly language support? Sometimes you need fence instructions or hardware compare-and-swap.

jasonwatkinspdx9y ago

mappu9y ago

There are some interesting caveats to using sync.atomic. Some working code on x86_64 did panic on x86_32 since the target struct member was no longer 64-bit aligned.

jasonwatkinspdx9y ago

Yeah. I believe the API docs do call that out. There's not much that can be done about it either, short of forcing 8 byte alignment for all golang objects on 32bit platforms.

1 more reply

fmstephe9y ago

mappu, do you have a need for 64-bit aligned data?

I would be interested if you did. I have a repo that provides a set of packages for doing things like that. If that is a real need I would be interested to add it to the library.

1 more reply

jalfresi9y ago

Hats off, this is a GREAT post! Lots of detail and I'm way out of my depth but lots to dig into and learn from!

Thanks for posting!

happytrails9y ago

I didn't make it past the bar pic :(

employee80009y ago

Comments like this are NOT appreciated on Hacker News, especially if you haven't taken the time to read the article.

softawre9y ago

Really? This seems like (potentially poorly worded) feedback that might be useful to the author.

_Codemonkeyism9y ago

Good that you know what is appreciated on Hacker News to tell us.

Oh, you've meant the comment police from the ministry of truth will kill his karma, yes they surely will. They always do.

PS: Down voting below zero is not moderation it's punishment, and yes, I know they enjoy it.

oneloop9y ago

I appreciated it.

I notice that in about 80% of your comments on your comment history you mention how old you are. Maybe you're just getting grumpy?

j / k navigate · click thread line to collapse