CPU Pinning and CPU Sets (2020) (opens in new tab)

(netmeister.org)

55 pointsarnold_palmur4y ago39 comments

39 comments

I've maintained a QEMU fork with pinning support, and even coauthored a research paper on the Linux pinning performance topic, and the results have been... underwhelming; "sadly" the Linux kernel does a pretty good job at scheduling :)

I advise pinning users to carefully measure the supposed performance improvement, as there is a tangible risk of spending time on imaginary gains.

bravetraveler4y ago

I found the most gains in terms of... latency consistency. I had a VM with a GPU passed through for gaming. With the cores appropriately pinned, especially away from host tasks, there were no more random DPC latency spikes.

With no pinning they'd randomly go into the milliseconds -- with pinning it would stay in the micro second range!

The result of this is games (and likely audio) performing much more favorably.

How much of this is cache coherency/in-fighting, scheduling, or simply host usage; I couldn't tell you. I was just happy to have my VM 'feel' native.

There will always be a benefit with pinning vCPUs on the same NUMA nodes as their devices (VFIO or even SR-IOV). This is becoming increasingly important on hypervisors

mochomocha4y ago

In a setup with high-level of containers collocation on large ec2 instances, we've seen the opposite behavior at Netflix: default CFS performing badly. We've AB tested our flavor of custom pinning and measured substantial benefits: https://netflixtechblog.com/predictive-cpu-isolation-of-cont...

PMC data at scale is pretty clear: very often, CFS won't do the right thing and will leave bad HT neighbors on the same core, leading to L1 thrashing, or keep a high-level of imbalance between NUMA sockets leading to degraded LLC hit rate.

sm_ts4y ago

Thanks, that's a very interesting case.

I correct my statement with "_did_ a good job", and appreciate rigorous testing.

waynesonfire4y ago

Not sure how you maintaining QEMU makes you a credible source for evaluating a schedulers performance. It's apparent to me the performance of the scheduler is a function of the workload, so YMMV.

I worked on a project where we collected detailed production runtime characteristics and evaluated scheduler algorithms against it. Tiny improvements made for massive savings.

sm_ts4y ago

I definitely correct my "does" a good job with "did" a job. But ultimately, I've advised a good deal of caution, which I think is fair, in particular, considering that only a small fraction of the companies has a compute scale where tiny improvements make massive savings.

wmf4y ago

At my last job we initially saw performance loss due to pinning; I think multiple QEMU I/O threads got pinned to a single CPU. It's very easy to do it wrong.

guilhas4y ago

I have looked around a bit, complicated to get right, very lite performance gains, most people doing it for gaming report

mochomocha4y ago

YMMV. We've seen M$ worth of cloud savings at Netflix doing pinning right. Knowing that the task scheduler is also heavily forked in Google's kernel, I'm ready to bet they've seen order of magnitude higher savings in their own DCs as well.

Agingcoder4y ago

Agreed, in my case it became very useful on large boxes (96 physical cores). The performance gain was about 10%.

darnir4y ago

Would you mind sharing the paper on pinning? I'd be interested

sm_ts4y ago

Hello! I'll write you via email.

foton19814y ago

Kubernetes makes CPU pinning rather simple. Just need to meet conditions to reach Guaranteed QoS. https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...

We are running lots of Erlang on k8s and CPU pinning improves performance of Erlang schedulers tremendously.

bogomipz4y ago

Interesting. I would be curious to hear why pinning here improves performance. Is this something specific to the BEAM VM? Does this come at hit to K8S scheduler flexibility?

toast04y ago

I don't have experience with k8s, but with BEAM on a traditional system, if BEAM is using the bulk of your CPU, you'll tend to get better results if each of the (main) BEAM scheduler threads is pinned to one CPU thread. Then all of the BEAM scheduler balancing can work properly. If both the OS and BEAM are trying to balance things, you can end up with a lot of extra task movement or extra lock contention when a BEAM thread gets descheduled by the OS to run a different BEAM thread that wants the same lock.

On most of the systems I ran, we didn't tend to have much of anything running on BEAMs dirty schedulers or other OS processes. If you have more of a mix of things, leaving things unpinned may work better.

versale4y ago

Is your setup open source? I'd love to know more about upsides of erlang/otp on top of k8s. Do you use hot code reloads?

Sohcahtoa824y ago

Tangental, but does anyone know of a Windows utility for automatically pinning processes?

I like to keep up with several cryptocurrency prices on Coinbase, but the Coinbase Pro pages consume a pretty significant amount of CPU time. I'd love to be able to just shove all of those processes to a single CPU thread to reduce the impact on overall system performance.

I suppose it wouldn't be too hard to write a Python script that does this automatically...scan Window titles to look for "Coinbase Pro", find the owning PID, then call SetAffinity...

ayende4y ago

The windows task manager has the ability to set process affinity

Sohcahtoa824y ago

Well, yeah, but I'm looking for a way to automate it. If I restart Firefox, all those affinities get reset.

bogomipz4y ago

This class looks great. I noticed the course page states:

>"This class overlaps significantly with CS392 ``Systems Programming'' -- if you have taken this class, please talk to me in person before trying to register for CS631."[1]

Does anyone know if the videos for CS392 might also be online? I tried to some basic URL substitutions however I came up empty.

[1] https://stevens.netmeister.org/631/

nuclx4y ago

Does anyone know how the methods mentioned by the author map to 'taskset'?

StillBored4y ago

Or numactl, the latter is where this really starts to make a lot of sense. The perf improvements of keeping individual threads/processes pinned to a small core group (say sharing a L2 cache on Arm machines) tend to be fairly trivial in comparison to what happens when something gets migrated to a different numa node with a large latency to the memory/resident cache data.

1_player4y ago

CPU pinning is pretty useful for virtual machines, i.e. I've used it myself to improve performance on a VFIO setup, by limiting which cores where qemu runs on and thus improving cache locality.

https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#CP...

What are other real-world uses of CPU pinning?

jandrewrogers4y ago

Databases and other high-throughput data infrastructure software use CPU pinning, also HPC. The reasons are similar: higher cache locality, reduced latency, and more predictable scheduling. It is most useful when the process is taking over part or all of the resources of the machine anyway.

mugsie4y ago

Memory and PCIE lanes in larger systems can be attached to particular CPUs, or to sub sections of a single CPU (i.e. AMD Threadrippers / Eypcs in particular) where traversing the the inter CPU / CCX links can cause latency or bandwidth issues.

The software will be pinned to CPU cores close to the RAM or PCIE device they are using.

Only really seen it be an issue in crazy large scale systems, or where you have 4 CPUs, but I haven't spent a huge amount of time on microsecond critical workloads.

amarshall4y ago

Isn’t this particular issue partially solved with proper NUMA support in whatever kernel or scheduler is being used?

spacechild14y ago

The Supernova audio server (https://github.com/supercollider/supercollider/tree/develop/...) pins each thread of its DSP thread pool to a dedicated core.

gpderetta4y ago

When implementing one-thread-per-core software architectures, explicit pinning is pretty much a requirement.

lclarkmichalek4y ago

Much cheaper than CPU cgroups if you want some corse grained isolation when stacking workloads

inetknght4y ago

CPU pinning can be particularly important if you're running virtual machines and/or hyperthreading-friendly workloads

jeffbee4y ago

Glad you mentioned hyperthreading. That can be easy to overlook. You reserved CPU 1 for a given workload? Did you remember CPU 49 as well?

thanatos5194y ago

The main point of HT is to reduce the cost of context switching by keeping twice the number of contexts close to the core. I would guess that parts of the process context like program counter, TLB, etc live inside the 'HT' and would have to be saved/restored every time the process moves between threads, even on the same core. Reserving both 'HT' on a core gets you cache locality, but isn't there a cost to moving the process back and forth, even if that data is in L1/L2?

(I'm looking at 'lstopo' from package 'hwloc', Linux on my Haswell Xeon: 10MB shared L3, 256KB L2, 32KB L1{d,i} per core)

Given my (educated) guess, I've told irqbalance to put interrupts only on 'thread 0' and then I schedule cpu-intensive tasks to 'thread 1' and schedule them very-not-nicely. Linux seems pretty good about keeping everything else on 'thread 0' when I have 'thread 1' busy so I don't do any further management.

I can have 4 cores 'thread 1' pegged at 100% with no impact on interactive or I/O performance.

jeffbee4y ago

In the context of the article, if you are trying to keep foreign processes "off my cores" then you can't neglect to keep them off the adjacent hyperthreads, because those share some of the resources. If you have 8 threads on 4 cores then at least the way Linux counts them cores 0 and 4 are sharing some caches and all backend execution resources. So if you have isolated core 0 but not core 4 you might as well have not done anything at all.

1 more reply

krona4y ago

True, however, CPU pinning is not the same as reserving/isolating the CPU. This is often not made clear in articles about CPU pinning.

j / k navigate · click thread line to collapse

39 comments

sm_ts4y ago

I advise pinning users to carefully measure the supposed performance improvement, as there is a tangible risk of spending time on imaginary gains.

bravetraveler4y ago

With no pinning they'd randomly go into the milliseconds -- with pinning it would stay in the micro second range!

The result of this is games (and likely audio) performing much more favorably.

How much of this is cache coherency/in-fighting, scheduling, or simply host usage; I couldn't tell you. I was just happy to have my VM 'feel' native.

There will always be a benefit with pinning vCPUs on the same NUMA nodes as their devices (VFIO or even SR-IOV). This is becoming increasingly important on hypervisors

mochomocha4y ago

sm_ts4y ago

Thanks, that's a very interesting case.

I correct my statement with "_did_ a good job", and appreciate rigorous testing.

waynesonfire4y ago

Not sure how you maintaining QEMU makes you a credible source for evaluating a schedulers performance. It's apparent to me the performance of the scheduler is a function of the workload, so YMMV.

I worked on a project where we collected detailed production runtime characteristics and evaluated scheduler algorithms against it. Tiny improvements made for massive savings.

sm_ts4y ago

wmf4y ago

At my last job we initially saw performance loss due to pinning; I think multiple QEMU I/O threads got pinned to a single CPU. It's very easy to do it wrong.

guilhas4y ago

I have looked around a bit, complicated to get right, very lite performance gains, most people doing it for gaming report

mochomocha4y ago

Agingcoder4y ago

Agreed, in my case it became very useful on large boxes (96 physical cores). The performance gain was about 10%.

darnir4y ago

Would you mind sharing the paper on pinning? I'd be interested

sm_ts4y ago

Hello! I'll write you via email.

foton19814y ago

Kubernetes makes CPU pinning rather simple. Just need to meet conditions to reach Guaranteed QoS. https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...

We are running lots of Erlang on k8s and CPU pinning improves performance of Erlang schedulers tremendously.

bogomipz4y ago

Interesting. I would be curious to hear why pinning here improves performance. Is this something specific to the BEAM VM? Does this come at hit to K8S scheduler flexibility?

toast04y ago

versale4y ago

Is your setup open source? I'd love to know more about upsides of erlang/otp on top of k8s. Do you use hot code reloads?

Sohcahtoa824y ago

Tangental, but does anyone know of a Windows utility for automatically pinning processes?

I suppose it wouldn't be too hard to write a Python script that does this automatically...scan Window titles to look for "Coinbase Pro", find the owning PID, then call SetAffinity...

ayende4y ago

The windows task manager has the ability to set process affinity

Sohcahtoa824y ago

Well, yeah, but I'm looking for a way to automate it. If I restart Firefox, all those affinities get reset.

bogomipz4y ago

This class looks great. I noticed the course page states:

>"This class overlaps significantly with CS392 ``Systems Programming'' -- if you have taken this class, please talk to me in person before trying to register for CS631."[1]

Does anyone know if the videos for CS392 might also be online? I tried to some basic URL substitutions however I came up empty.

[1] https://stevens.netmeister.org/631/

nuclx4y ago

Does anyone know how the methods mentioned by the author map to 'taskset'?

StillBored4y ago

1_player4y ago

CPU pinning is pretty useful for virtual machines, i.e. I've used it myself to improve performance on a VFIO setup, by limiting which cores where qemu runs on and thus improving cache locality.

https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#CP...

What are other real-world uses of CPU pinning?

jandrewrogers4y ago

mugsie4y ago

The software will be pinned to CPU cores close to the RAM or PCIE device they are using.

Only really seen it be an issue in crazy large scale systems, or where you have 4 CPUs, but I haven't spent a huge amount of time on microsecond critical workloads.

amarshall4y ago

Isn’t this particular issue partially solved with proper NUMA support in whatever kernel or scheduler is being used?

spacechild14y ago

The Supernova audio server (https://github.com/supercollider/supercollider/tree/develop/...) pins each thread of its DSP thread pool to a dedicated core.

gpderetta4y ago

When implementing one-thread-per-core software architectures, explicit pinning is pretty much a requirement.

lclarkmichalek4y ago

Much cheaper than CPU cgroups if you want some corse grained isolation when stacking workloads

inetknght4y ago

CPU pinning can be particularly important if you're running virtual machines and/or hyperthreading-friendly workloads

jeffbee4y ago

Glad you mentioned hyperthreading. That can be easy to overlook. You reserved CPU 1 for a given workload? Did you remember CPU 49 as well?

thanatos5194y ago

(I'm looking at 'lstopo' from package 'hwloc', Linux on my Haswell Xeon: 10MB shared L3, 256KB L2, 32KB L1{d,i} per core)

I can have 4 cores 'thread 1' pegged at 100% with no impact on interactive or I/O performance.

jeffbee4y ago

1 more reply

krona4y ago

True, however, CPU pinning is not the same as reserving/isolating the CPU. This is often not made clear in articles about CPU pinning.

j / k navigate · click thread line to collapse