Mutexes are faster than Spinlocks (opens in new tab)

(matklad.github.io)

373 pointserickt6y ago145 comments

145 comments

twic6y ago

The author has an implicit definition of "faster" which it is important to be aware of.

The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.

That is, if you have a worker task which is waiting for a supervisor task to tell it to do something, then to minimise the time between the supervisor giving the order and the worker getting to work, use a spinlock.

For this to really work, you need to implement both tasks as threads pinned to dedicated cores, so they won't be preeempted. You will burn a huge amount of CPU doing this, and so it won't be "faster" from a throughput point of view. But the latency will be as low as it's possible to go.

simias6y ago

>The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.

The main use of spinlocks that I'm aware of is dealing with interrupt handlers in driver code. In this situation you generally can't go to sleep (sleeping with interrupts disabled is generally a good way of never waking up) so calling "mutex_lock" is simply out of the question. That's probably a niche use but that's literally the only situation I've ever used actual spin locks instead of mutexes, mainly for the reasons outlined by TFA.

lmilcin6y ago

I did some algorithmic/ high frequency trading and we were only spinning waiting for task and never, ever releasing to OS. But then the entire server was dedicated to one job and the job consisted of reacting to single stream of messages.

Most of the time I would say if lock performance impacts your application you are likely doing something wrong. There are many ways of solving typical synchronization problems without locks and if you need locks there are many ways to amortize costs.

Mutexes are not faster than spin locks the same way slower cars are not faster than fast cars. You might still crash in a fast car and be slower to the end of the race (to the supermarket) but that is just your failure to use the power you were given responsibly.

vba6y ago

TFA? Tried to google but unsuccessful

NeedMoreTea6y ago

I've seen lots of valid variants, including and probably not limited to:

The feature article ; the freaking article ; the friendly article ; the f$%^&*( article, c.f. RTFM.

1 more reply

axaxs6y ago

The (Fuc|Frea)king Article. In a similar spirit to RTFM.

hyperion20106y ago

The aForementioned Article

the F is usually something else though ...

2 more replies

loeg6y ago

The Fine Article.

wgjordan6y ago

The (forementioned) article, I would imagine.

thatsenough6y ago

The Fabulous Article.

ddebernardy6y ago

TFA is Slashdot jargon for "The F'ing Article".

1 more reply

rcxdude6y ago

And this only makes sense in a multi-core system (where the spinlock is held on another core). In a single core system if you are trying to take a contended lock in an interrupt you are stuck and can only fail to take it and hope to handle that gracefully.

trasz6y ago

And even that is only true for systems that don’t use interrupt threads.

sillysaurusx6y ago

Ya’ll should consider using atomic increment on separate cache lines instead of spinlocks. If you want to minimize latency to the bare minimum, atomic increment gives you two orders of magnitude measurable improvements over locks. https://lmax-exchange.github.io/disruptor/files/Disruptor-1....

rcxdude6y ago

They solve different problems. A lock-free datastructure can be much fast than one with locks, but they depend on exactly what operations you can do fast and atomically (for example, the LMAX queue length is limited to a power of 2 because there's no way to atomically increment an integer which will wrap at arbitrary values, and the modulo operator is too expensive for non-power of two values). A lock is a primitive which allows you to make arbitrary operations 'atomic' (though in these benchmarks generally a simple increment is used). A lock-free queue may be a good replacement for a queue using locks, but this is only in one application of locks.

Also, the disruptor design is based on a system where there is a 1:1 correspondance between threads and cores. If this is not the case it will likely still interact somewhat poorly with the OS's scheduler without any blocking at all, for the same reason as spinlocks (in fact, if you use the library you can customize this behaviour).

extropy6y ago

As far as I'm aware that's exactly how you would implement a spinlock.

Random Google search seems to validate that.

https://stackoverflow.com/questions/1383363/is-my-spin-lock-...

sillysaurusx6y ago

Compare-and-swap isn't quite the same as atomic increment. An atomic increment can't fail; it always increments. Whereas CAS will fail if a different thread has modified the value.

The highest performance design is to use a ringbuffer. To write to the ringbuffer, you atomic-increment the "claim" counter, giving you an index. Now take index modulo ringbuffer size. That's the slot to write to. After you're done writing to it, set your "publish" counter to the index.

Each writer has a "publish" counter, and the "claim" counter isn't allowed to move beyond the lowest "publish" counter modulo ringbuffer size.

Each reader uses a similar strategy: the reader has a current counter. You find the minimum of all publish counter values, then process each slot up to that value, and set your counter to that value. The "claim" counter isn't allowed to move past the minimum of all reader counters.

Hence, everyone is reading from / writing to separate cache lines, and there are no locks at all. The sole primitive is atomic increment, which can never fail. (The only failure condition is "one of the slots hasn't been processed yet" (i.e. the publish counter is <= claim counter minus ringbuffer size) at which point everybody waits for the slot to be processed.

You can wait using multiple strategies: a spin loop (which isn't the same as a spinlock because you're only reading a value in memory, not calling any atomic primitives), yielding CPU time via sched_yield(), or calling sleep. All strategies have tradeoffs. Calling sleep is the least CPU intensive, but highest latency. Vice-versa for spin loop.

Takeaway: there are no locks. Just increments.

1 more reply

gpderetta6y ago

Also if both threads are pinned to separate cores and nothing else is supposed to run on those cores, it is pointless to use anything but spinlocks as there is no other thread that could better use the core (and probably you do not want the core to go to a low power syate waiting for an interrupt).

jdblair6y ago

You're discounting energy use. This is a bad strategy on a battery powered device.

sephamorr6y ago

In addition to energy, power use is another reason; parking a core will allow it to cool down thermally, so that when it is put back in use (milli)seconds later, it can run at a higher clock speed for longer.

corysama6y ago

Intel has a low-power PAUSE instruction that is literally a ‘rep nop’. I assume Arm has one too.

1 more reply

gpderetta6y ago

Yes, you wouldn't use this strategy on a battery powered device. It ia for very specialised applications.

rumanator6y ago

> and nothing else is supposed to run on those cores

That's quite the corner case.

titzer6y ago

This is exactly the situation for a well-balanced parallel work queue. You want to start as many threads as there are cores and run them full tilt pulling work off the queue until it is empty. If you're running a large scale cluster that is dedicated to a particular task (e.g. like servicing a special kind of query, or encoding videos, rendering, etc), this is very common, or even a parallel Photoshop filter.

1 more reply

shaklee36y ago

This is the normal use case for any DPDK software. I think anyone involved in HPC or high-speed networking knows that this is pretty common.

gpderetta6y ago

Yes, in practice you have to dedicate the whole machine for a specific application, but the one thread per isolated core is a proven one for high performance/low latency applications.

Snelius6y ago

Absolutely! spinlocks is an inter-processor mechanism and mutex inter-processes. Principal difference.

jakemal6y ago

What is a use case for optimizing interprocessor latency?

anonymoushn6y ago

Hello,

I wrote some software at an automated trading firm that would do the subset of "parsing ticks and maintaining order books" sufficient to map ticks to symbols, and no more work than that. My code was pinned to one core and spun waiting for an expensive network card to DMA some ticks into memory, then when ticks we cared about were found it would load them onto one or more queues. Each queue corresponded to a different thread spin waiting on its own pinned core.

So one use case is "transforming ticks into trading decisions slightly faster than you otherwise could"

twic6y ago

Comment-OP here - this is also more or less my use case.

Whether there is a use case for spinlocks outside low-latency trading, i have no idea!

3 more replies

dfox6y ago

Essentialy any true HPC workload. You can go pretty far by ignoring latency and doing some series of map and reduce stages on top of giant conceptual file with large-ish records. There are workloads where data for these partial parallel jobs are on the order of machine word or two and then you really care about communication latency between cores.

brobinson6y ago

Check out DPDK/VPP and the LMAX Disruptor paper.

shitgoose6y ago

And do not forget Chronicle Queue.

lonelappde6y ago

Finishing a high priority job sooner.

monocasa6y ago

Like TLB shootdown IPIs.

nathanielherman6y ago

This experiment is a bit weird. If you look at https://github.com/matklad/lock-bench, this was run on a machine with 8 logical CPUs, but the test is using 32 threads. It's not that surprising that running 4x as many threads as there are CPUs doesn't make sense for spin locks.

I did a quick test on my Mac using 4 threads instead. At "heavy contention" the spin lock is actually 22% faster than parking_lot::Mutex. At "extreme contention", the spin lock is 22% slower than parking_lot::Mutex.

Heavy contention run:

  $ cargo run --release 4 64 10000 100
      Finished release [optimized] target(s) in 0.01s
      Running `target/release/lock-bench 4 64 10000 100`
  Options {
      n_threads: 4,
      n_locks: 64,
      n_ops: 10000,
      n_rounds: 100,
  }

  std::sync::Mutex     avg 2.822382ms   min 1.459601ms   max 3.342966ms  
  parking_lot::Mutex   avg 1.070323ms   min 760.52µs     max 1.212874ms  
  spin::Mutex          avg 879.457µs    min 681.836µs    max 990.38µs    
  AmdSpinlock          avg 915.096µs    min 445.494µs    max 1.003548ms  

  std::sync::Mutex     avg 2.832905ms   min 2.227285ms   max 3.46791ms   
  parking_lot::Mutex   avg 1.059368ms   min 507.346µs    max 1.263203ms  
  spin::Mutex          avg 873.197µs    min 432.016µs    max 1.062487ms  
  AmdSpinlock          avg 916.393µs    min 568.889µs    max 1.024317ms

Extreme contention run:

  $ cargo run --release 4 2 10000 100
      Finished release [optimized] target(s) in 0.01s
      Running `target/release/lock-bench 4 2 10000 100`
  Options {
      n_threads: 4,
      n_locks: 2,
      n_ops: 10000,
      n_rounds: 100,
  }

  std::sync::Mutex     avg 4.552701ms   min 2.699316ms   max 5.42634ms   
  parking_lot::Mutex   avg 2.802124ms   min 1.398002ms   max 4.798426ms  
  spin::Mutex          avg 3.596568ms   min 1.66903ms    max 4.290803ms  
  AmdSpinlock          avg 3.470115ms   min 1.707714ms   max 4.118536ms  

  std::sync::Mutex     avg 4.486896ms   min 2.536907ms   max 5.821404ms  
  parking_lot::Mutex   avg 2.712171ms   min 1.508037ms   max 5.44592ms   
  spin::Mutex          avg 3.563192ms   min 1.700003ms   max 4.264851ms  
  AmdSpinlock          avg 3.643592ms   min 2.208522ms   max 4.856297ms

hinkley6y ago

The top comment opens up the concept of latency versus throughput. My interpretation is that this experiment is demonstrating that optimizing only for latency has consequences elsewhere in the system. Which is not surprising at all, but then again I spend a lot of time explaining unsurprising things.

I remember a sort of sea change in my thinking on technical books during a period where I tended to keep them at work instead of at home. I noticed a curious pattern in which ones were getting borrowed and by whom. Reading material isn't only useful if it has something new to me in it. It's also useful if it presents information I already know and agree with, in a convenient format. Possibly more useful, in fact.

rwem6y ago

If you only have 4 threads it is likely that all your CPUs are sharing caches and you won't see the real downside of the spinlock. They don't really fall apart until you have several sockets.

nathanielherman6y ago

Note that I get a similar speedup with 6 and 8 threads on my Mac (which has 8 logical CPUs)

ww5206y ago

Modern day user-mode pthread mutex uses the futex kernel syscall [1] to implement the lock, which avoids the syscall in the non-contended case, so it can be very fast, acquiring the lock in the user-mode code entirely. I'm not sure whether the Rust's mutex API is a wrapper on the pthread mutex or calling the old kernel's mutex syscall directly.

Basically the user mode mutex lock is implemented as:

    // In user-mode, if the lock flag is free as 0, lock it to 1 and exit
    while !atomic_compare_and_swap(&lock, 0, 1)
        // Lock not free, sleeps until the flag is changed back to 0
        futex_wait(&lock, 0)  // syscall to kernel

When futex_wait() returns, it means the flag has been set back to 0 by the other thread's unlock, and the kernel wakes my thread up so I can check it again. However, another thread can come in and grab the lock in the meantime, so I need to loop back to check again. The atomic CAS operation is the one acquiring the lock.

[1] https://github.com/lattera/glibc/blob/master/nptl/pthread_mu...

Edit: the atomic_compare_and_swap can be just a macro to the assembly CMPXCHG, so it's very fast to acquire a lock if no one else holding the lock.

bluejekyll6y ago

https://github.com/rust-lang/rust/blob/master/src/libstd/sys...

Looks like it’s relying glibc for lock elision.

Mind you the parking_lot::Mutex in the article is not the stdlib implementation, documented here: https://docs.rs/parking_lot/0.10.0/parking_lot/type.Mutex.ht...

And that looks like it’s not using pthread, instead relying on primitives: https://github.com/Amanieu/parking_lot/blob/master/src/raw_m...

ww5206y ago

Then it's using the futex implementation. It's very efficient.

bluejekyll6y ago

I updated my comment. The fastest impl, parking_lot, appears to be a ground up atomic based mutex that doesn’t rely on pthread at all.

1 more reply

sharken6y ago

This is pretty much the conclusion in this game related post on spinlocks: https://probablydance.com/2019/12/30/measuring-mutexes-spinl...

jakswa6y ago

Oh wow/yikes, Linus Torvalds commented on that recently: https://www.realworldtech.com/forum/?threadid=189711&curpost...

"So you might want to look into not the standard library implementation, but specific locking implentations for your particular needs. Which is admittedly very very annoying indeed. But don't write your own. Find somebody else that wrote one, and spent the decades actually tuning it and making it work.

Because you should never ever think that you're clever enough to write your own locking routines.. Because the likelihood is that you aren't (and by that "you" I very much include myself - we've tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).

There's a reason why you can find decades of academic papers on locking. Really. It's hard."

hinkley6y ago

We've been arguing about concurrency primitives literally for decades and the 'worst' part is that for most of that time, all of the competing solutions were documented by the same individual - Tony Hoare - within a narrow period in the early 1970's. Soon that will be 50 years ago. Watching people argue is like the People's Liberation Front of Judea scene in Life of Brian. As far as I know, borrow checking may be the first real change in that arena in decades, and I bet even that is older than most of us know.

I was getting heckled recently about my allergy to trying to negotiate interprocess consensus through the filesystem. I've seen similar conversations about how hard it is to 'know' the state of files, especially in a cross platform or cross filesystem way (see also the decade old fsync bug in PostgreSQL we were talking about early last year). In our case several dozen machines all have to come to the same decision at the same time (because round robin) and I was having none of it. I eventually had to tell him, in the nicest way possible, to fuck off, I'm doing this through Consul.

The thing is that people who generally don't learn from their mistakes absolutely do not learn from their old mistakes. So for any bug they introduce (like a locking problem) that takes months or quarters to materialize, they will not internalize any lessons from that experience. Not only wasn't I gonna solve the problem the way he wanted, but if he tried to take over it, we'd have broken code at some arbitrary future date and he'd learn nothing. He could not understand why I in particular and people in general were willing to die on such a hill.

Anger is not the best tool for communication, but as someone once put it, it's the last signal you have available that your boundaries are being violated and this needs to stop. Especially if you're the one who will have to maintain a bad decision. As often as I critique Linus for the way he handles sticky problems, on some level he is not wrong.

ncmncm6y ago

Back around 2012 I worked with a guy, a FreeBSD kernel committer, who insisted volatile was sufficient as a thread synchronization primitive. He convinced our boss.

1 more reply

0x0aff3746686y ago

This is a great example of how so many programmers think they are the first to discover something that is "ancient". In reality this is a problem older than Linux (linux just had the advantage of a huge and diverse open-source userbase from which to draw data). On the flipside, lots of good information takes years to go from discussion groups to books, so it can be hard to find relevant research. Applied compsci is literally the cutting edge of reality.

dcolkitt6y ago

> Second, the uncontended case looks like > Parking_lot::Mutex avg 6ms min 4ms max 9ms

This estimate is way too high for the uncontested mutex case. On a modern Linux/Xeon system using GCC, an uncontested mutex lock/unlock is well under 1 microsecond.

I have a lot of experience here from writing low-latency financial systems. The hot path we use is littered with uncontested mutex lock/unlock, and the whole path still runs under 20 microseconds. (With the vast majority of that time unrelated to mutex acquisition.)

The benchmark used in the blog post must be spending the vast majority of its time in some section of code that has nothing to do with lock/unlock.

gpm6y ago

You're misreading the benchmark, that's 6ms for 10,000 lock/unlocks per thread, 320,000 lock/unlocks total. In other words 0.6 microseconds per thread per lock.

rwem6y ago

That's still unreasonably high, isn't it? Even a Go sync.Mutex, not exactly a hot-rod implementation, can be acquired and released in < 50ns on the garbage hardware I have before me.

gpderetta6y ago

On Intel (and probably very similar on AMD) the cost of a completely uncontented, cache hit, simple spin lock acquisition is ~20 clock cycles while the release is almost free.

atq21196y ago

This is the time for whole benchmark run, not for an individual lock/unlock. The article is quite clear on that.

ncmncm6y ago

As we say in low-latency finance, "a microsecond is an eternity."

If you have threads interacting, whether via mutexes or spinlocks, you have a high-latency system.

kcolford6y ago

This absolutely makes sense in userspace. The most important part of a spinlock in an OS is that you can yield to the scheduler instead of taking up CPU time on the core. But that defeats the purpose of using a spinlock when in userspace because you still have to syscall

temac6y ago

A spinlock in the kernel typically only spins. You use it for the cases when you can't schedule... So it better will be for a short time only.

But the concept of short time does not even exist deterministically in userspace, usually, because it can always be preempted. So don't use pure spinlocks in userspace, unless you really really know what you are doing (and that includes knowing how your kernel works in great details, in the context of how you use it).

extropy6y ago

How about a new opcode wait till memory address read equals? That would allow implementing a power efficient spinlock.

Oh there is one already. Meet PAUSE: https://www.felixcloutier.com/x86/pause

Edit: related post from 2018 https://news.ycombinator.com/item?id=17336853

dbaupp6y ago

The benchmarked spin-locks are using it, via https://doc.rust-lang.org/std/sync/atomic/fn.spin_loop_hint....

Implementation: https://doc.rust-lang.org/src/core/hint.rs.html#64-93

jnordwick6y ago

MONITOR/MWAIT will get you the part where a thread will pause and the cpu can rest until a write on a store on an address range, then the waiting thread is allowed to continue. You can't wait on a specific value though.

gok6y ago

Note that all of the locks tested here are unfair, which is why they all show very high waiting variance. Until recently many mutex implementations aimed for fairness, which made them much slower than spinlocks in microbenchmarks like this.

Diggsey6y ago

Actually, the parking_lot mutex is fair: https://docs.rs/parking_lot/0.10.0/parking_lot/type.Mutex.ht...

The high waiting variance is because the benchmark randomly decides which locks to take, meaning that the amount of contention is variable.

sanxiyn6y ago

Note that fair locking is still expensive. parking_lot achieves both speed and fairness by starting with unfair locking and falling back to fair locking.

matklad6y ago

Benchmark uses fixed seeds, there’s no inherent randomness.

jnordwick6y ago

This comes around every so often, and it isn't very interesting in that the best mutexes basically spins 1 or a couple times then falls back to a lock. It isn't a true pure spinlock vs pure lock (mutex/futex) fight.

I think the linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.

His benchmark is weird, but maybe I'm reading it wrong:

* Instead of reducing thread count to reduce contention he appears to increase the number of locks available. This is still a bad scenario for spinlocks since they will still have bad interactions with scheduler (they will use a large amount of cpu time when and get evicted from the run queue and need to be rescheduled).

* Also, I didn't see him pin any of the threads, so all those threads will start sharing some cpus since the OS does need to be doing some work on a them too.

* And Rust can be a little hard to read, but it seem he packed his locks on the same cache line? I don't see any padding in his AmdSpinlock struct. That would be a huge blow for the spinlock because of the false sharing issues. He's getting all the cache coherence traffic still because of it.

The worst cases for the spinlock are not understanding scheduling costs and the cache thrashing that can occur.

What they call the AMD spinlock (basically just a regular sane spinlock that tries to prevent cache thrashing) has its best performace with a low number of threads, assigned to different cores under the same L3 segment.

(Does anybody know if AMD's new microarchitecture went away from the MOESI/directory based cache towards Intel's MESIF/snoop model?)

The MOESI model might have performed better in this regard under worse case scenario since it doesn't need to write the cache line back and can just forward around the dirty line as it keeps track of who owns it.

And if you run under an MESIF-based cache and you can keep your traffic local to your L3 segment, you are backstopped there and never need to go anywhere else.

A spinlock is a performance optimization and should be treated as one. You need to have intimate knowledge of the architecture you are running under and the resources being used.

(edit: answered my own question, apparently the vdso is still pretty limited in what it exports, so no. it isn't happening at this time from what i can tell.)

atq21196y ago

The locks should be on separate cachelines, that's what the CachePadded::new is for.

futex cannot be implemented in the VDSO since it needs to call into the scheduler.

Another way to think about this: VDSO is used for syscalls that are (mostly) read-only and can avoid the kernel mode switch on a common fast path. The futex syscall is already the raw interface that is designed on the assumption that the caller only uses it in the slow path of whatever higher-level synchronization primitive they're implementing, so trying to use VDSO tricks to implement futex would be redundant.

jnordwick6y ago

> The locks should be on separate cachelines, that's what the CachePadded::new is for

I see that now. I looking for it in the AmdSpinlock struct, but that kind of makes sense.

> The futex syscall is already the raw interface that is designed on the assumption that the caller only uses it in the slow path of whatever higher-level synchronization primitive they're implementing, so trying to use VDSO tricks to implement futex would be redundant.

Ah. Thanks. I didn't know how far you could get with MWAIT, but I guess you still need to deschedule. I also didn't realize futex was a direct syscall and there was no user level api going on around it.

Is he running 32 threads even in the low contention case? And not pinning? There's something about his numbers that just seem a little too high for what I would expect. I've seen this around a lot, and the reason the mutex usually wins is that is basically does a spin of 1 or more then goes into a mutex (the pthread code appers to spin 100 times before falling back to a futex).

At work I use a spin lock on a shared memory region because it does test out to be lower latency than std::mutex and we're not under much contention. I've though about replacing it with a light futex-based library, but doesn't seem to be quicker.

He still seems to be getting some contention, and I'm trying figure out how.

gpderetta6y ago

> linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.

The semantic of a futex wait is a request to the kernel to put the thread to sleep until another thread issues a signal for the same "key".

The trick is that the key is the address of a 32 bit memory location (provided to the kernel in both the signal and wait syscalls), which is normally the mutex address.

So a process first tries to acquire the mutex as for a normal spin lock (with a CAS, exchange, xadd or whatever work for the specific algorithm) attempting to set the lock bit to 1; if it fails (possibly after spin-trying a few times), it invokes the futex wait. As you can see the syscall is only done in the slow path.

On a futex release, the simple solution is to always invoke futex signal syscall after setting the lock bit to zero[1]. To fast path the relase, a wait bit can be set on the acquire path just before calling futex wait, so on the release path, when setting the lock bit to zero, signal would only be called if the waiter bit was set (and the cleared by together with the lock bit)[2].

As you can see the futex syscall is already the slow path and never need to be onvoked in the uncontented case. In fact the kernel doesn't even need to know about the futex untill the first contention.

[1] futexes are edge triggered, same as condition variables, so a signal will only wakes any thread blocked ona wait that happened-before the signal call. Thus there is a race condition if a lock rrlease and signal happens between the failed acquire attempt and the wait call; to prevent this futex wait as an additional parameter that is the expected value of the memory location: the kernel checks the futex address against this valur and will only if it hasn't changed will put the thread to sleep (this is done atomically).

[2] as there could me multiple waiters, a broadcast is actually needed here which leads to a thundering herd as all waiters will race to acquire the lock. The original futex paper used a wait count instead of just a bit but there are other options.

any16y ago

According to "man vdso", futex is not in vDSO on any architecture. What eliminates the syscall cost is that it's rarely called due to elision.

A mutex may try to spin more than a "couple of times" before it calls futex. Example: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_...

rrss6y ago

> is that it's rarely called due to elision.

Calling this 'elision' is a bit confusing, since the glibc developers use the term 'lock elision' to refer to hardware lock elision via Intel TSX extensions.

What eliminates the syscall cost is optimistic spinning in userspace, so that locking only does futex_wait in cases of contention when the lock isn't released 'soon' after the thread starts trying to acquire it.

hackworks6y ago

Not an expert here.

In a spin lock, the lock state is checked in a tight loop by all waiters. This will be using some sort of memory fence. FWIK, memory fence or barriers flush the CPU cache and would initiate reading the variable (spin lock state) for evaluation. I would expect spin locking overheads to increase with number of cores.

On NUMA, I think flushing is more expensive. Hence, spin locks have an additional overhead of having to load and evaluate on every spin as against being woken up for mutexes (like a callback)

oppositelock6y ago

Yes, spin locks require atomic memory operations - as do many other OS primitives. CPUs have dedicated instructions to do this more quickly for locks than generalized coherent memory, for example cmpxchg on x86. There are microarchitectural optimizations to make these things work well.

I've been writing MP code since the early 90's; SPARC, MIPS, x86, PPC, SH-4, Alpha, and all have instructions to make this easier for you.

Spin locks are very useful for locks held for short durations, and only on MP systems, particularly in the kernel, since a spin lock there can deadlock the whole system on a single CPU machine. The very general rule of thumb is that you generally want blocking locks, unless there is a very specific reason to use a spin lock. Those very specific reasons might be waiting briefly on some device in a driver, or running a realtime system where wake latency is critical, etc. Most of the time, spin locks will burn CPU for no good reason, and you still have to implement the non-spinlock case anyway, in case you are running on a 1 CPU system. So, might as well start with blocking locks, and only add spin locks where performance would benefit from it.

rwem6y ago

It doesn’t flush the entire cache (that would be a disaster) but it does shoot down the cache line containing the lock in all cores other than the one that acquired the lock.

The real issue with spin locks is fairness. There’s no assurance that any given thread will ever make progress. A thread could starve forever. Production-ready mutexes like absl::Mutex make efforts toward fairness, even if they don’t have hard guarantees.

the84726y ago

> but it does shoot down the cache line containing the lock in all cores other than the one that acquired the lock.

Well, as long as you do the test-CAS instead of the pure-CAS approach not every loop iteration results in cache line bouncing.

Plus intel has introduced the MWAIT[0] instruction to implement something similar to futex in hardware, i.e. the hyperthread can sleep until another core updates the cacheline in question.

[0] https://www.felixcloutier.com/x86/mwait

atq21196y ago

That's true, though MWAIT can only be used from kernel mode. At least the docs say that you get a #UD exception if you attempt to use it from user mode.

1 more reply

xtacy6y ago

Aren't most modern spinlock implementations are ticket-based, which ensures fairness among waiters (FIFO-like)? The linux kernel implementations most definitely are.

I agree that the naive implementations are not.

temac6y ago

Completely fair locks can have extremely terrible throughput performance, so depending what you are doing with them, they might not be a good idea...

elteto6y ago

> checked in a tight loop by all waiters

This actually does not have to be this way. You could have a linked list of spinlocks, one for each waiter. Each waiter spins on its own, unique spinlock. When the previous waiter is done it unlocks the next spinlock, and so on. The implementation gets a bit complicated on non-GC languages, since there are races between insertion/removal on the linked list. If the number of threads is fixed and long-lived then it becomes easier, since instead of a linked list you can have an array of spinlocks.

Note: in some architectures (x86?) you could possibly do away with atomics, since (I believe) int updates are atomic by default. Not really sure though.

xtacy6y ago

Yep, that's the underlying principle behind Mellor-Crummey Spinlock: https://lwn.net/Articles/590243/.

elteto6y ago

Yes! The MCS spinlock, the name eluded me. The paper is actually a pretty good read (it is linked in the lwn article).

gpderetta6y ago

> you could possibly do away atomics, since (I believe) int updates are atomic by default.

The release can be a compiler only barrier followed by a simple store, but you do need an atomic RMW in the acquire path.

It is technically possible to implement a lock with just loads and stores (see Peterson lock[1]) even in the acquire path but it does require a #StoreLoad memory barrier even on Intel, which is as expensive as an atomic RMW so it is not worth it.

Edit:fixed barrier name typo.

[1] https://en.m.wikipedia.org/wiki/Peterson%27s_algorithm

temac6y ago

> #StoreRelease memory barrier even on Intel

What is that? x86 is TSO...

Do you have an example of the full acquire and release sections in x86 assembly?

1 more reply

hackworks6y ago

NetApp has a very interesting implementation RW spin lock inside the kernel. I tried optimizing to make it more fair for writers by introducing a single variable that would be check in each iteration.

The additional delay it added for checking the state resulted in deadlock timer triggering a dump! Hence, checking a variable is very expensive.

mareko6y ago

If you're interested in very efficient MCS style reader writer locks, check out this paper:

https://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.p...

lowbloodsugar6y ago

TFA makes the point that modern "mutex" implementations actually use spinlocks first and only fall back to heavy, kernel Mutexes if there is contention. So the title is click-baity. Mutexes are slower than spinlocks. The "faster than spinlocks" mutexes in this article are actually spinlocks that fallback to mutexes.

Then the benchmark uses spinlocks in situations that spinlocks aren't great for. And, surprise, spinlocks are slower, than spinlocks-with-mutexes.

Spinlocks are great in situations such as:

  * There are far more resources than threads,
  * The probability of actually having to spin is small,
    ideally if the time spent in the lock is a few instructions
  * When you can't use optimistic concurrency*

* because perhaps the number of memory locations to track is too complicated for my poor brain and I can't be arsed to remember how TLA+ works

There's plenty of linked list implementations, for example, that use optimistic concurrency. At that point you've got yourself a thread-safe message queue and that might be better than mutexes, too.

jakswa6y ago

Coming as a ruby developer who dabbles in async rust in my free time, these posts/threads/links have been my best read of 2020 so far. My CS curriculum barely covered real-world lock usage, much less the ties to linux/windows schedulers + modern CPU caching. Big thanks all around to these internet commentators.

Ericson23146y ago

As the person that added a bunch of functionality to spin-rs making it roughly match the std API, yes you should not use spinlocks in scheduled code.

That said, I see why Rust makes things so annoying. I want lots of code to work in no-std so have a robust embedded and kernel ecosystem. It would be really nice to abstract over the locking implementation with type parameters, but that required "higher kinded types" which Rust doesn't have. The alternative is relying on carefully coordinated Cargo features, which is much more flaky and hard to audit (e.g. with the type system). Given that, I am not sure people over-use spin locks.

lilyball6y ago

I'm really curious how macOS's os_unfair_lock compares here.

LeoNatan256y ago

Is that implementation open source? Don’t remember which dyld contains it.

gok6y ago

https://github.com/apple/darwin-libplatform/blob/master/src/...

AnanasAttack6y ago

Thread scheduling (or waking up cores) is slow. Because of this, mutexes will look better on dumb benchmarks, as the contending threads keep going to sleep, while the single succesful owner has practically uncontended access

AstralStorm6y ago

There are various degrees of slow, in addition to kernel being smarter about multiple cores and SMT siblings than your application.

Kernel can run your code on a cooled core, giving it higher clock, for example. Ultimately making it run faster.

Of course this won't show in a benchmark where all the threads do mostly calculation rather than contention, but that's not the typical case. That mostly shows up in compute such as multithreaded video where latency does not matter one bit.

Typically you have more of a producer/consumer pattern where consumer sleeps, and it's beneficial to run it on a cold CPU, assuming the kernel woke it up beforehand.

Source: hit some latency issues with an ancient kernel on a nastily hacked big little architecture ARM machine. It liked to overheat cores and put heavy tasks on the overheated ones for alleged power saving. (Whereas running a task quicker saves power.)

bubbleRefuge6y ago

couldn't you eliminate the bad spinlock behavior by coding them to be go into an efficient wait if to much spinning is going on ?

rwem6y ago

That’s what virtually all battle-hardened lock libraries do: spin for a bit (but not too tightly, using a pause in the loop) then fall back to waiter lists and futex.

anewvillager6y ago

Yes! That's called an optimistic spinning lock!

https://linuxplumbersconf.org/event/4/contributions/286/atta...

CodeWriter236y ago

Or use a mutex and let the scheduler implement equitable dispatching.

kccqzy6y ago

Yes. And then it becomes an adaptive mutex.

shin_lao6y ago

Most mutexes implementations spin before truly acquiring the lock.

Also, there are better spinlock implementations, such as speculative spinlocks, and queued locks.

CoolGuySteve6y ago

atq21196y ago

This is the time for whole benchmark run, not for an individual lock/unlock. The article is quite clear on that.

paulintrognon6y ago

Fun explanation of what a Mutex is: https://stackoverflow.com/questions/34524/what-is-a-mutex

j / k navigate · click thread line to collapse

145 comments

twic6y ago

The author has an implicit definition of "faster" which it is important to be aware of.

The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.

simias6y ago

>The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.

lmilcin6y ago

vba6y ago

TFA? Tried to google but unsuccessful

NeedMoreTea6y ago

I've seen lots of valid variants, including and probably not limited to:

The feature article ; the freaking article ; the friendly article ; the f$%^&*( article, c.f. RTFM.

1 more reply

axaxs6y ago

The (Fuc|Frea)king Article. In a similar spirit to RTFM.

hyperion20106y ago

The aForementioned Article

the F is usually something else though ...

2 more replies

loeg6y ago

The Fine Article.

wgjordan6y ago

The (forementioned) article, I would imagine.

thatsenough6y ago

The Fabulous Article.

ddebernardy6y ago

TFA is Slashdot jargon for "The F'ing Article".

1 more reply

rcxdude6y ago

trasz6y ago

And even that is only true for systems that don’t use interrupt threads.

sillysaurusx6y ago

rcxdude6y ago

extropy6y ago

As far as I'm aware that's exactly how you would implement a spinlock.

Random Google search seems to validate that.

https://stackoverflow.com/questions/1383363/is-my-spin-lock-...

sillysaurusx6y ago

Compare-and-swap isn't quite the same as atomic increment. An atomic increment can't fail; it always increments. Whereas CAS will fail if a different thread has modified the value.

Each writer has a "publish" counter, and the "claim" counter isn't allowed to move beyond the lowest "publish" counter modulo ringbuffer size.

Takeaway: there are no locks. Just increments.

1 more reply

gpderetta6y ago

jdblair6y ago

You're discounting energy use. This is a bad strategy on a battery powered device.

sephamorr6y ago

corysama6y ago

Intel has a low-power PAUSE instruction that is literally a ‘rep nop’. I assume Arm has one too.

1 more reply

gpderetta6y ago

Yes, you wouldn't use this strategy on a battery powered device. It ia for very specialised applications.

rumanator6y ago

> and nothing else is supposed to run on those cores

That's quite the corner case.

titzer6y ago

1 more reply

shaklee36y ago

This is the normal use case for any DPDK software. I think anyone involved in HPC or high-speed networking knows that this is pretty common.

gpderetta6y ago

Yes, in practice you have to dedicate the whole machine for a specific application, but the one thread per isolated core is a proven one for high performance/low latency applications.

Snelius6y ago

Absolutely! spinlocks is an inter-processor mechanism and mutex inter-processes. Principal difference.

jakemal6y ago

What is a use case for optimizing interprocessor latency?

anonymoushn6y ago

Hello,

So one use case is "transforming ticks into trading decisions slightly faster than you otherwise could"

twic6y ago

Comment-OP here - this is also more or less my use case.

Whether there is a use case for spinlocks outside low-latency trading, i have no idea!

3 more replies

dfox6y ago

brobinson6y ago

Check out DPDK/VPP and the LMAX Disruptor paper.

shitgoose6y ago

And do not forget Chronicle Queue.

lonelappde6y ago

Finishing a high priority job sooner.

monocasa6y ago

Like TLB shootdown IPIs.

nathanielherman6y ago

Heavy contention run:

  $ cargo run --release 4 64 10000 100
      Finished release [optimized] target(s) in 0.01s
      Running `target/release/lock-bench 4 64 10000 100`
  Options {
      n_threads: 4,
      n_locks: 64,
      n_ops: 10000,
      n_rounds: 100,
  }

  std::sync::Mutex     avg 2.822382ms   min 1.459601ms   max 3.342966ms  
  parking_lot::Mutex   avg 1.070323ms   min 760.52µs     max 1.212874ms  
  spin::Mutex          avg 879.457µs    min 681.836µs    max 990.38µs    
  AmdSpinlock          avg 915.096µs    min 445.494µs    max 1.003548ms  

  std::sync::Mutex     avg 2.832905ms   min 2.227285ms   max 3.46791ms   
  parking_lot::Mutex   avg 1.059368ms   min 507.346µs    max 1.263203ms  
  spin::Mutex          avg 873.197µs    min 432.016µs    max 1.062487ms  
  AmdSpinlock          avg 916.393µs    min 568.889µs    max 1.024317ms

Extreme contention run:

  $ cargo run --release 4 2 10000 100
      Finished release [optimized] target(s) in 0.01s
      Running `target/release/lock-bench 4 2 10000 100`
  Options {
      n_threads: 4,
      n_locks: 2,
      n_ops: 10000,
      n_rounds: 100,
  }

  std::sync::Mutex     avg 4.552701ms   min 2.699316ms   max 5.42634ms   
  parking_lot::Mutex   avg 2.802124ms   min 1.398002ms   max 4.798426ms  
  spin::Mutex          avg 3.596568ms   min 1.66903ms    max 4.290803ms  
  AmdSpinlock          avg 3.470115ms   min 1.707714ms   max 4.118536ms  

  std::sync::Mutex     avg 4.486896ms   min 2.536907ms   max 5.821404ms  
  parking_lot::Mutex   avg 2.712171ms   min 1.508037ms   max 5.44592ms   
  spin::Mutex          avg 3.563192ms   min 1.700003ms   max 4.264851ms  
  AmdSpinlock          avg 3.643592ms   min 2.208522ms   max 4.856297ms

hinkley6y ago

rwem6y ago

If you only have 4 threads it is likely that all your CPUs are sharing caches and you won't see the real downside of the spinlock. They don't really fall apart until you have several sockets.

nathanielherman6y ago

Note that I get a similar speedup with 6 and 8 threads on my Mac (which has 8 logical CPUs)

ww5206y ago

Basically the user mode mutex lock is implemented as:

    // In user-mode, if the lock flag is free as 0, lock it to 1 and exit
    while !atomic_compare_and_swap(&lock, 0, 1)
        // Lock not free, sleeps until the flag is changed back to 0
        futex_wait(&lock, 0)  // syscall to kernel

[1] https://github.com/lattera/glibc/blob/master/nptl/pthread_mu...

Edit: the atomic_compare_and_swap can be just a macro to the assembly CMPXCHG, so it's very fast to acquire a lock if no one else holding the lock.

bluejekyll6y ago

https://github.com/rust-lang/rust/blob/master/src/libstd/sys...

Looks like it’s relying glibc for lock elision.

Mind you the parking_lot::Mutex in the article is not the stdlib implementation, documented here: https://docs.rs/parking_lot/0.10.0/parking_lot/type.Mutex.ht...

And that looks like it’s not using pthread, instead relying on primitives: https://github.com/Amanieu/parking_lot/blob/master/src/raw_m...

ww5206y ago

Then it's using the futex implementation. It's very efficient.

bluejekyll6y ago

I updated my comment. The fastest impl, parking_lot, appears to be a ground up atomic based mutex that doesn’t rely on pthread at all.

1 more reply

sharken6y ago

This is pretty much the conclusion in this game related post on spinlocks: https://probablydance.com/2019/12/30/measuring-mutexes-spinl...

jakswa6y ago

Oh wow/yikes, Linus Torvalds commented on that recently: https://www.realworldtech.com/forum/?threadid=189711&curpost...

There's a reason why you can find decades of academic papers on locking. Really. It's hard."

hinkley6y ago

ncmncm6y ago

Back around 2012 I worked with a guy, a FreeBSD kernel committer, who insisted volatile was sufficient as a thread synchronization primitive. He convinced our boss.

1 more reply

0x0aff3746686y ago

dcolkitt6y ago

> Second, the uncontended case looks like > Parking_lot::Mutex avg 6ms min 4ms max 9ms

This estimate is way too high for the uncontested mutex case. On a modern Linux/Xeon system using GCC, an uncontested mutex lock/unlock is well under 1 microsecond.

The benchmark used in the blog post must be spending the vast majority of its time in some section of code that has nothing to do with lock/unlock.

gpm6y ago

You're misreading the benchmark, that's 6ms for 10,000 lock/unlocks per thread, 320,000 lock/unlocks total. In other words 0.6 microseconds per thread per lock.

rwem6y ago

That's still unreasonably high, isn't it? Even a Go sync.Mutex, not exactly a hot-rod implementation, can be acquired and released in < 50ns on the garbage hardware I have before me.

gpderetta6y ago

On Intel (and probably very similar on AMD) the cost of a completely uncontented, cache hit, simple spin lock acquisition is ~20 clock cycles while the release is almost free.

atq21196y ago

This is the time for whole benchmark run, not for an individual lock/unlock. The article is quite clear on that.

ncmncm6y ago

As we say in low-latency finance, "a microsecond is an eternity."

If you have threads interacting, whether via mutexes or spinlocks, you have a high-latency system.

kcolford6y ago

temac6y ago

A spinlock in the kernel typically only spins. You use it for the cases when you can't schedule... So it better will be for a short time only.

extropy6y ago

How about a new opcode wait till memory address read equals? That would allow implementing a power efficient spinlock.

Oh there is one already. Meet PAUSE: https://www.felixcloutier.com/x86/pause

Edit: related post from 2018 https://news.ycombinator.com/item?id=17336853

dbaupp6y ago

The benchmarked spin-locks are using it, via https://doc.rust-lang.org/std/sync/atomic/fn.spin_loop_hint....

Implementation: https://doc.rust-lang.org/src/core/hint.rs.html#64-93

jnordwick6y ago

gok6y ago

Diggsey6y ago

Actually, the parking_lot mutex is fair: https://docs.rs/parking_lot/0.10.0/parking_lot/type.Mutex.ht...

The high waiting variance is because the benchmark randomly decides which locks to take, meaning that the amount of contention is variable.

sanxiyn6y ago

Note that fair locking is still expensive. parking_lot achieves both speed and fairness by starting with unfair locking and falling back to fair locking.

matklad6y ago

Benchmark uses fixed seeds, there’s no inherent randomness.

jnordwick6y ago

I think the linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.

His benchmark is weird, but maybe I'm reading it wrong:

* Also, I didn't see him pin any of the threads, so all those threads will start sharing some cpus since the OS does need to be doing some work on a them too.

The worst cases for the spinlock are not understanding scheduling costs and the cache thrashing that can occur.

(Does anybody know if AMD's new microarchitecture went away from the MOESI/directory based cache towards Intel's MESIF/snoop model?)

And if you run under an MESIF-based cache and you can keep your traffic local to your L3 segment, you are backstopped there and never need to go anywhere else.

A spinlock is a performance optimization and should be treated as one. You need to have intimate knowledge of the architecture you are running under and the resources being used.

(edit: answered my own question, apparently the vdso is still pretty limited in what it exports, so no. it isn't happening at this time from what i can tell.)

atq21196y ago

The locks should be on separate cachelines, that's what the CachePadded::new is for.

futex cannot be implemented in the VDSO since it needs to call into the scheduler.

jnordwick6y ago

> The locks should be on separate cachelines, that's what the CachePadded::new is for

I see that now. I looking for it in the AmdSpinlock struct, but that kind of makes sense.

He still seems to be getting some contention, and I'm trying figure out how.

gpderetta6y ago

> linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.

The semantic of a futex wait is a request to the kernel to put the thread to sleep until another thread issues a signal for the same "key".

The trick is that the key is the address of a 32 bit memory location (provided to the kernel in both the signal and wait syscalls), which is normally the mutex address.

any16y ago

According to "man vdso", futex is not in vDSO on any architecture. What eliminates the syscall cost is that it's rarely called due to elision.

A mutex may try to spin more than a "couple of times" before it calls futex. Example: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_...

rrss6y ago

> is that it's rarely called due to elision.

Calling this 'elision' is a bit confusing, since the glibc developers use the term 'lock elision' to refer to hardware lock elision via Intel TSX extensions.

hackworks6y ago

Not an expert here.

On NUMA, I think flushing is more expensive. Hence, spin locks have an additional overhead of having to load and evaluate on every spin as against being woken up for mutexes (like a callback)

oppositelock6y ago

I've been writing MP code since the early 90's; SPARC, MIPS, x86, PPC, SH-4, Alpha, and all have instructions to make this easier for you.

rwem6y ago

It doesn’t flush the entire cache (that would be a disaster) but it does shoot down the cache line containing the lock in all cores other than the one that acquired the lock.

the84726y ago

> but it does shoot down the cache line containing the lock in all cores other than the one that acquired the lock.

Well, as long as you do the test-CAS instead of the pure-CAS approach not every loop iteration results in cache line bouncing.

Plus intel has introduced the MWAIT[0] instruction to implement something similar to futex in hardware, i.e. the hyperthread can sleep until another core updates the cacheline in question.

[0] https://www.felixcloutier.com/x86/mwait

atq21196y ago

That's true, though MWAIT can only be used from kernel mode. At least the docs say that you get a #UD exception if you attempt to use it from user mode.

1 more reply

xtacy6y ago

Aren't most modern spinlock implementations are ticket-based, which ensures fairness among waiters (FIFO-like)? The linux kernel implementations most definitely are.

I agree that the naive implementations are not.

temac6y ago

Completely fair locks can have extremely terrible throughput performance, so depending what you are doing with them, they might not be a good idea...

elteto6y ago

> checked in a tight loop by all waiters

Note: in some architectures (x86?) you could possibly do away with atomics, since (I believe) int updates are atomic by default. Not really sure though.

xtacy6y ago

Yep, that's the underlying principle behind Mellor-Crummey Spinlock: https://lwn.net/Articles/590243/.

elteto6y ago

Yes! The MCS spinlock, the name eluded me. The paper is actually a pretty good read (it is linked in the lwn article).

gpderetta6y ago

> you could possibly do away atomics, since (I believe) int updates are atomic by default.

The release can be a compiler only barrier followed by a simple store, but you do need an atomic RMW in the acquire path.

Edit:fixed barrier name typo.

[1] https://en.m.wikipedia.org/wiki/Peterson%27s_algorithm

temac6y ago

> #StoreRelease memory barrier even on Intel

What is that? x86 is TSO...

Do you have an example of the full acquire and release sections in x86 assembly?

1 more reply

hackworks6y ago

The additional delay it added for checking the state resulted in deadlock timer triggering a dump! Hence, checking a variable is very expensive.

mareko6y ago

If you're interested in very efficient MCS style reader writer locks, check out this paper:

https://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.p...

lowbloodsugar6y ago

Then the benchmark uses spinlocks in situations that spinlocks aren't great for. And, surprise, spinlocks are slower, than spinlocks-with-mutexes.

Spinlocks are great in situations such as:

  * There are far more resources than threads,
  * The probability of actually having to spin is small,
    ideally if the time spent in the lock is a few instructions
  * When you can't use optimistic concurrency*

* because perhaps the number of memory locations to track is too complicated for my poor brain and I can't be arsed to remember how TLA+ works

There's plenty of linked list implementations, for example, that use optimistic concurrency. At that point you've got yourself a thread-safe message queue and that might be better than mutexes, too.

jakswa6y ago

Ericson23146y ago

As the person that added a bunch of functionality to spin-rs making it roughly match the std API, yes you should not use spinlocks in scheduled code.

lilyball6y ago

I'm really curious how macOS's os_unfair_lock compares here.

LeoNatan256y ago

Is that implementation open source? Don’t remember which dyld contains it.

gok6y ago

https://github.com/apple/darwin-libplatform/blob/master/src/...

AnanasAttack6y ago

AstralStorm6y ago

There are various degrees of slow, in addition to kernel being smarter about multiple cores and SMT siblings than your application.

Kernel can run your code on a cooled core, giving it higher clock, for example. Ultimately making it run faster.

Typically you have more of a producer/consumer pattern where consumer sleeps, and it's beneficial to run it on a cold CPU, assuming the kernel woke it up beforehand.

bubbleRefuge6y ago

couldn't you eliminate the bad spinlock behavior by coding them to be go into an efficient wait if to much spinning is going on ?

rwem6y ago

That’s what virtually all battle-hardened lock libraries do: spin for a bit (but not too tightly, using a pause in the loop) then fall back to waiter lists and futex.

anewvillager6y ago

Yes! That's called an optimistic spinning lock!

https://linuxplumbersconf.org/event/4/contributions/286/atta...

CodeWriter236y ago

Or use a mutex and let the scheduler implement equitable dispatching.

kccqzy6y ago

Yes. And then it becomes an adaptive mutex.

shin_lao6y ago

Most mutexes implementations spin before truly acquiring the lock.

Also, there are better spinlock implementations, such as speculative spinlocks, and queued locks.

CoolGuySteve6y ago

atq21196y ago