%CPU utilization is a lie (opens in new tab)

(brendanlong.com)

437 pointsBrendanLong8mo ago168 comments

168 comments

ot8mo ago

Utilization is not a lie, it is a measurement of a well-defined quantity, but people make assumptions to extrapolate capacity models from it, and that is where reality diverges from expectations.

Hyperthreading (SMT) and Turbo (clock scaling) are only a part of the variables causing non-linearity, there are a number of other resources that are shared across cores and "run out" as load increases, like memory bandwidth, interconnect capacity, processor caches. Some bottlenecks might come even from the software, like spinlocks, which have non-linear impact on utilization.

Furthermore, most CPU utilization metrics average over very long windows, from several seconds to a minute, but what really matters for the performance of a latency-sensitive server happens in the time-scale of tens to hundreds of milliseconds, and a multi-second average will not distinguish a bursty behavior from a smooth one. The latter has likely much more capacity to scale up.

Unfortunately, the suggested approach is not that accurate either, because it hinges on two inherently unstable concepts

> Benchmark how much work your server can do before having errors or unacceptable latency.

The measurement of this is extremely noisy, as you want to detect the point where the server starts becoming unstable. Even if you look at a very simple queueing theory model, the derivatives close to saturation explode, so any nondeterministic noise is extremely amplified.

> Report how much work your server is currently doing.

There is rarely a stable definition of "work". Is it RPS? Request cost can vary even throughout the day. Is it instructions? Same, the typical IPC can vary.

Ultimately, the confidence intervals you get from the load testing approach might be as large as what you can get from building an empirical model from utilization measurement, as long as you measure your utilization correctly.

eklitzke8mo ago

I agree. If you actually know what you're doing you can use perf and/or ftrace to get highly detailed processor metrics over short periods of time, and you can see the effects of things like CPU stalls from cache misses, CPU stalls from memory accesses, scheduler effects, and many other things. But most of these metrics are not very actionable anyway (the vast majority of people are not going to know what to do with their IPC or cache hit or branch hit numbers).

What most people care about is some combination of latency and utilization. As a very rough rule of thumb, for many workloads you can get up to about 80% CPU utilization before you start seeing serious impacts on workload latency. Beyond that you can increase utilization but you start seeing your workload latency suffer from all of the effects you mentioned.

To know how much latency is impacted by utilization you need to measure your specific workload. Also, how much you care about latency depends on what you're doing. In many cases people care much more about throughput than latency, so if that's the top metric then optimize for that. If you care about application latency as well as throughput then you need to measure both of those and decide what tradeoffs are acceptable.

tracker18mo ago

> There is rarely a stable definition of "work". Is it RPS? Request cost can vary even throughout the day. Is it instructions? Same, the typical IPC can vary.

I think this is probably one of the most important points... similarly, is this public facing work dealing with any kind of user request, or is it simply crunching numbers/data to build an AI model from a stable backlog/queue?

My take has always been with modern multi-core, hyper-threaded CPUs that are burstable is to consider ~60% a "loaded" server. That should have work split if it's that way for any significant portion of a day. Mostly dealing with user-facing services. So bursts and higher traffic portions of the day are dramatically different from lower utilization portions of the day.

A decade ago, this lead to a lot of work for cloud provisioning on demand for the heavier load times. Today it's a bit more complicated when you have servers with 100+ cores as an option for under $30k (guestimate based on $10k CPU price). Today, I'd lean to over-provisioning dedicated server hardware and supplement with cloud services (and/or self-cloud-like on K8s) as pragmatically as reasonable... depending on the services of course. I'm not currently in a position where I have this level of input though.

Just looking at how, as an example, StackOverflow scaled in the early days is even more possible/prudent today to a much larger extent... You can go a very long way with a half/full rack and a 10gb uplink in a colo data center or two.

In any case, for me... >= 65% CPU load for >= 30m/day means it's at 100% effective utilization, and needs expansion relatively soon. Just my own take.

everforward8mo ago

> In any case, for me... >= 65% CPU load for >= 30m/day means it's at 100% effective utilization, and needs expansion relatively soon.

I think this depends on workload still because IO heavy apps hyperthread well and can push up to 100%. I think most of the apps I've worked on end up being IO bound because "waiting on SQL results" or the more generic "waiting on downstream results" is 90% of their runtime. They might spend more time reading those responses off the wire than they do actually processing anything.

There are definitely things that isn't true of though, and your metrics read about right to me.

p12tic8mo ago

> Today it's a bit more complicated when you have servers with 100+ cores as an option for under $30k (guestimate based on $10k CPU price).

If one can buy used, then previous generation 128C 256T epyc server is less than $5k. For homelabs that can accept non-rackmount gear it's less than $3k.

jimmySixDOF8mo ago

IEEE Hot Interconnects just wrapped up and they discussed latency performance tuning for Ultra Ethernet where it looks smooth on 2- or 5- sec view but at 100ms you see the obvious frame burst effects. If you don't match your profiling to the workload a false negative compounds your original problem by thinking you tested this so better look elsewhere.

SAI_Peregrinus8mo ago

That's all true, and the % part is still a lie. As you note, CPU utilization isn't linear, and percentages are linear measures. CPU utilization isn't a lie, % CPU utilization is.

ot8mo ago

It is a linear percentage of the amount of time the CPU is not idle. It is not linear in the amount of useful work, but that's not what "utilization" means.

The lie is the assumption that CPU time is linear in useful work, but that has nothing to do with the definition of utilization, it's just something that people sometimes naively believe.

> CPU utilization isn't a lie, % CPU utilization is

What do you mean by this? Utilization is, by definition, a ratio. % just determines that the scale is in [0, 100].

2 more replies

SirMaster8mo ago

What about 2 workloads that both register 100% CPU usage, but one workload draws significantly more power and heats the CPU up way more? Seems like that workload is utilizing more of the CPU, more of the transistors or something.

inetknght8mo ago

Indeed, and there's a thing called "race to sleep". That is, you want to light up as much of the core as possible as fast as possible so you can get the CPU back to idle as soon as possible to save on battery power, because having the CPU active for more time (but not using as many circuits as it "could") draws a lot more power.

1 more reply

saagarjha8mo ago

Yes, this is pretty normal; your processor will downclock to accommodate. For HPC where the workloads are pretty clearly defined it’s possible to even measure how close you’re coming to the thermal envelope and adjust the workload.

throwaway311318mo ago

Percent utilization for most operating systems is the amount of time the idle task is not scheduled. So for both workloads the idle task was never scheduled, hence 100% "utilization".

BrendanLongOP8mo ago

Some esoteric methods of measuring CPU utilizations are to calculate either the current power usage over the max available power, or the current temperature over the max operating temperature. Unfortunately these are typically even more non-linear than the standard metrics (but they can be useful sometimes).

1 more reply

kqr8mo ago

It might be a lie, but it surely is a practical one. In my brief foray into site reliability engineering I used CPU utilisation (of CPU-bofund tasks) with queueing theory to choose how to scale servers before big events.

The %CPU suggestions ran contrary to (and were much more conservative than) the "old wisdom" that would otherwise have been used. It worked out great at much lower cost than otherwise.

What I'm trying to say is you shouldn't be afraid of using semi-crappy indicators just because they're semi-crappy. If it's the best you got it might be good enough anyway.

In the case of CPU utilisation, though, the number in production shouldn't go above 40 % for many reasons. At 40 % there's usually still a little headroom. The mistake of the author was not using fundamentals of queueing theory to avoid high utilisation!

therealdrag08mo ago

> semi-crappy indicator … good enough.

Agree. Another example of this is for metrics as percentiles per host that you have to average, vs histograms per host that get percentile calculated at aggregation time among hosts. Sure an avg/max of a percentile is technically not a percentile, but in practice switching between one or the other hasn’t affected my operations at all. Yet I know some people are adamant about mathematical correctness as if that translates to operations.

arccy8mo ago

That works ok when you have evenly distributed load (which you want / would hope to have), much less so when your workload is highly unbalanced.

mayama8mo ago

Combination of CPU% and loadavg would generally tell how system is doing. I had systems where loadavg is high, waiting on network/io, but little cpu%. Tracing high load is not always straightforward as cpu% though, you have to go through io%, net%, syscalls etc.

saagarjha8mo ago

40% seems quite lightly utilized tbh

cpncrunch8mo ago

I tend to use 50% as a soft target, which seems like a good compromise. Sometimes it may go a little bit over that, but if it's occasional it shouldn't be an issue.

It's not good to go much over 50% on a server (assuming half the cpus are just hyperthreads), because you're essentially relying on your load being able to share the actual cpu cores. At some point, when the load increases too much, there may not be any headroom left for sharing those physical cpus. You then get to the point where adding a little bit more load to 80% suddenly results in 95% utilization.

kqr8mo ago

It depends on how variable the load is, compared to how fast the servers can scale up and down, etc. I often have as a rule of thumb to have enough headroom to be able to deal with twice the load while staying within a triple of the response time. You can solve the equations for your specific case, but eyeballing graphs such as [1] I end up somewhere in the area of 40 %.

The important part is of course to ask yourself the question "how much increased load may I need to handle, and how much can I degrade system performance in doing so?" You may work in an industry that only ever sees 10 % additional load at timescales where scaling is unfeasible, and then you can pick a significantly higher normal utilisation level. Or maybe you're in an industry where you cannot degrade performance by more than 10 % even if hit by five times the load – then you need a much, much more conservative target for utilisation.

[1]: https://erikbern.com/assets/wait-time-2.png

paravz8mo ago

Cpu utilization %% needs to be contrasted with a "business" metric like latency or RPS. Depending on the environment and hardware 40% can be too utilized or way underutilized

1 more reply

zekrioca8mo ago

I noticed exactly the same thing. The author is saying something that has been repeatedly written in queueing theory books for decades, still they are noticing this only now.

mustache_kimono8mo ago

Reminds me of Brendan Gregg's "CPU Utilization is Wrong" but this blog fails to discuss that blog's key point that CPU utilization is a measure of whether or not the CPU is busy, including whether the CPU is waiting [0]. That blog also explains that the IPC (instructions per cycle) metric actually measures useful work hidden within that busy state.

[0]: https://www.brendangregg.com/blog/2017-05-09/cpu-utilization...

4gotunameagain8mo ago

What's up with Brendans and CPU utilisation concerns, any Brendan to shine some light ?

BrendanLongOP8mo ago

I'd love to explain, but you'd need to change your name to Brendan first.

PaulKeeble8mo ago

This is bang on, you can't count the hyperthreads as double the performance, typically they are actually in practice only going to bring 15-30% if the job works well with it and their use will double the latency. Failing to account for loss in clockspeed as the core utilisation climbs is another way its not linear and in modern software for the desktop its really something to pay careful attention to.

It should be possible from the information you can get on a CPU from the OS to better estimate utilisation involving at the very least these two factors. It becomes a bit more tricky to start to account for significantly going past the cache or available memory bandwidth and the potential drop in performance to existing threads that occurs from the increased pipeline stalls. But it can definitely be done better than it is currently.

c2h5oh8mo ago

To complicate things more HT performance varies wildly between CPU architectures and workloads. e.g. AMD implementation, especially in later Zen cores, is closer to a performance of a full thread than you'd see in Intel CPUs. Provided you are not memory bandwidth starved.

RaftPeople8mo ago

> To complicate things more HT performance varies wildly between CPU architectures and workloads.

IBM's Power cpu's have also traditionally done a great job with SMT compared to Intel's implementation.

shim__8mo ago

Whats the difference between Intels and AMDs approach?

1 more reply

magicalhippo8mo ago

For memory-bound applications the scaling can be much better. A renderer I worked on was primarily memory-bound walking the accelerator structure, and saw 60-70% increase from hyperthreads.

But overall yeah.

Sohcahtoa828mo ago

Back when I got an i7-3770K (4C/8T), I did a very basic benchmark using POV-Ray.

Going from 1 thread to 2 threads doubled the speed as expected. Going from 2 to 4 doubled it again. Going from 4 to 8 was only ~15% faster.

I imagine you could probably create a contrived benchmark that actually gives you nearly double the performance from SMT, but I don't know what it would look like. Maybe some benchmark that is written to deliberately constantly miss cache?

Side note, I should run that POV-Ray test again. It's been years since I've even use POV-Ray.

tgma8mo ago

The way they refer to cores in their system is confusing and non-standard. The author talks about a 5900X as a 24 core machine and discusses as if there are 24 cores, 12 of which are piggybacking on the other 12. In reality, there are 24 hyperthreads that are pretty much pairwise symmetric that execute on top of 12 cores with two sets of instruction pipeline sharing same underlying functional units.

saghm8mo ago

Years ago, when trying to explain hyper threading to my brother, who doesn't have any specialized technical knowledge, he came up with the analogy that it's like 2-ply toilet paper. You don't quite have 24 distinct things, but you have 12 that are roughly twice as useful as the individual ones, although you can't really separate them and expect them to work right.

nayuki8mo ago

Nah, it's easier than that. Putting two chefs in the same kitchen doesn't let you cook twice the amount of food in the same amount of time, because sometimes the two chefs need to use the same resource at the same time - e.g. sink, counter space, oven. But, the additional chef does improve the utilization of the kitchen equipment, leaving fewer things unused.

1 more reply

BobbyTables28mo ago

That’s perfect!

Especially when it come to those advertisements “6 large rolls == 18 normal rolls”.

Sure it might be thicker but nobody wipes their butt with 1/3 a square…

skeezyboy8mo ago

> he came up with the analogy that it's like 2-ply toilet paper.

as in youd only use it to wipe excrement from around your sphincter

BrendanLongOP8mo ago

Thanks for the feedback. I think you're right, so I changed a bunch of references and updated the description of the processor to 12 core / 24 thread. In some cases, I still think "cores" is the right terminology though, since my OS (confusingly) reports utilization as-if I had 24 cores.

sroussey8mo ago

Eh, what’s a thread really? It’s a term for us humans.

The difference between two threads and one core or two cores with shared resources?

Nothing is really all that neat and clean.

It more of a 2 level NUMA type architecture with 2 sets of 6 SMP sets of 2.

The scheduler may look at it that way (depending), but to the end user? Or even to most of the system? Nah.

1 more reply

sroussey8mo ago

Will be interesting when (if?) Intel ships software defined cores which are the logical inverse of hyper threading.

Instead of having a big core with two instruction pipelines sharing big ALUs etc, they have two (or more) cores that combine resources and become one core.

Almost the same, yet quite different.

https://patents.google.com/patent/EP4579444A1/en

tgma8mo ago

There was the dreaded AMD FX chip which was advertised as 8 core, but shared functional units. Got sued, etc.

1 more reply

Neil448mo ago

If both SMT cores are being asked to do the same workload they will likely contend for the same resource and execution units internally so the boost from SMT will be less. If they have different workloads the boost will be more. Now throw in P and E cores on newer CPU's, turbo and non-turbo, everything gets very complicated. I did see a study that adding SMT got a much better performance per watt boost than adding turbo which was interesting/useful.

bboreham8mo ago

Worth noting that the major clouds will sell this as 24 "vcpus".

dragontamer8mo ago

There's many ways CPU utilization fails to work as expected.

I didn't expect an article on this style. I was expecting the normal Linux/Windows utilization but wtf it's all RAM bottlenecked and the CPU is actually quiet and possibly down clocking thing.

CPU Utilization is only how many cores are given threads to run by the OS (be it Windows or Linux). Those threads could be 100% blocked on memcpy but that's still CPU utilization.

-------

Hyperthreads help: if one thread is truly CPU bound (or even more specifically: AVX / Vector unit bound), while a 2nd thread is hyperthreaded together that's memcpy / RAM bound, you'll magically get more performance due to higher utilization of resources. (Load/store units are separate from AVX compute units).

In any case, this is a perennial subject with always new discoveries about how CPU Utilization is far less intuitive than many think. Still kinda fun to learn about new perspectives on this matter in any case.

freehorse8mo ago

Author discovers that performance does not scale proportionally to %CPU utilisation, and gets instead to the conclusion that %CPU utilisation is a lie.

There are many reasons for the lack of a proportional relationship, even in the case where you do not have hyperthreading or downclocking (in which cases you just need to interpret %CPU utilisation in that context, rather than declare it "a lie"). Even in apple silicon where these are usually not an issue, you often do not get an exactly proportional scaling. There may be overheads when utilising multiple cores wrt how data is passed around, or resource bottlenecks other than CPU.

saagarjha8mo ago

Apple silicon downclocks quite a lot especially if you have a passively cooled machine

freehorse8mo ago

With the exception of macbook air that has passive cooling nothing as aggressive as "turbo" modes, and ime it is relatively hard to get to thermal limits just with cpu in general for the devices I have used. Most other manufacturers nowadays officially advertise boosted single core clock speeds that are much higher and lower when more cores are used at the same time. Thermal limits, in contrast, are much more circumstantial.

judge1238mo ago

This hits so close to home. I once tried to explain to a manager that a server at 60% utilization had zero room left, and they looked at me like I had two heads. I wish I had this article back then!

hinkley8mo ago

You also want to hit him with queueing theory.

Up to a hair over 60% utilization the queuing delays on any work queue remain essentially negligible. At 70 they become noticeable, and at 80% they've doubled. And then it just turns into a shitshow from there on.

The rule of thumb is 60% is zero, and 80% is the inflection point where delays go exponential.

The biggest cluster I ran, we hit about 65% CPU at our target P95 time, which is pretty much right on the theoretical mark.

BrendanLongOP8mo ago

A big part of this is that CPU utilization metrics are frequently averaged over a long period of time (like a minute), but if your SLO is 100 ms, what you care about is whether there's any ~100 ms period where CPU utilization is at 100%. Measuring p99 (or even p100) CPU utilization can make this a lot more visible.

2 more replies

Ambroisie8mo ago

Do you have a link to a more in-depth analysis of the queuing theory for these numbers?

1 more reply

PunchyHamster8mo ago

that entirely depends on workload. especially now when average server CPUs start at 32 cores

0xbadcafebee8mo ago

The benchmark is basically application performance testing, which is the most accurate representation you can get. Test the specific app(s) your server is running, with real-world data/scenarios, and keep cranking up the requests, until the server falls over. Nothing else will give you as accurate an indication of your server's actual maximum performance with that app. Do that for every variable that's relevant (# requests/s, payload size, # parameters, etc), so you have multiple real-world maximum-performance indicators to configure your observability monitors for.

One way to get closer to reliable performance is to apply cpu scheduler limits to what runs your applications to keep them below a given threshold. This way you can better ensure you can sustain a given amount of performance. You don't want to run at 100% cpu for long, especially if disk i/o becomes hampered, system load skyrockets, and availability starts to plummet. Two thousand servers with 5000ms ping times due to system load is not a fun day at the office.

(And actually you'll never get a completely accurate view, as performance can change per-server. Rack two identical servers in two different racks, run the same app on each, and you may see different real-world performance. One rack may be hotter than the other, there could be hidden hardware or firmware differences, etc. Even within a server, if one CPU is just nearer a hotter component than on another server, for reasons)

CCs8mo ago

Uses stress-ng for benchmarking, even though the stress-ng documentation says it is not suitable for benchmarking. It was written to max out one component until it burns. Using a real app, like Memcached or Postgres would show more realistic numbers, closer to what people use in production. The difference is not major, 50% utilization is closer to 80% in real load, but it breaks down faster. Stress-ng is nicely linear until 100%, memcached will have a hockey stick curve at the end.

BrendanLongOP8mo ago

The advantage of stress-ng is that it's easy to make it run with specific CPU utilization numbers. The tests where I run some number of workers at 100% utilization are interesting since they give such perfect graphs, but I think the version where I have 24 workers and increase their utilization slowly is more realistic for showing how production CPU utilization changes.

BrendanLongOP8mo ago

Fun data point though, I just ran three data points of the Phoronix nginx benchmark and got these results:

- Pinned to 6 cores: 28k QPS

- Pinned to 12 cores: 56k QPS

- All 24 cores: 62k QPS

I'm not sure how this applies to realistic workloads where you're using all of the cores but not maxing them out, but it looks like hyperthreading only adds ~10% performance in this case.

3 more replies

kristopolous8mo ago

Tried to explain this in a job interview 5 years ago. They thought I was a bullshitter

bionsystem8mo ago

Happened to me on a different topic, felt bad for way too long ; in hindsight I'm pretty sure I dodged a bullet.

kristopolous8mo ago

This was the same interview where some guy was asking me about "big-o" - like the thing that you teach 19 year olds and I was saying that parallelization matters, i/o matters, quantization matters, whether you can run it on the GPU, these all matter.

The simple "big-o" number doesn't account for whether you need to pass terabytes over the bus for every operation - and on actual computers moving around terabytes, I know, shockingly, this affects performance.

And if you have a dual epyc board with 1,024 threads, being able to parallelize a solution and design things for cache optimization, this isn't meaningless.

It's a weak classifier - if you really think I'm going to be doing a lexical sort in like O(n^3) like some kind of clown, I don't know what you're hiring here.

Found out later he scored me "2/5".

Alright, cool.

1 more reply

swiftcoder8mo ago

I remember being stuck in a discussion with management one time, that went something like this: Manager: CPU utilisation is 100% under load! We have to migrate to bigger instances. Me: but is the CPU actually doing useful work?

(chat, it was not. busy waiting is CPU utilisation too)

kristianp8mo ago

How do you measure the amount of busy waiting?

swiftcoder8mo ago

I don't think there is a good general tool for this. In this specific case, I went spelunking for all the points where we had thread contention over resources, and discovered that for several resources quite a lot of CPU cycles were being expended to no use. The goal is really to eliminate the underlying resource contention - we added per-thread caches I various places, swapped out the logging system, and were able to ~double the system throughput during times when top showed the system to be "fully loaded"

ChaoPrayaWave8mo ago

These days I treat CPU usage as just a hint, not a conclusion. I also look at response times, queue lengths, and try to figure out what the app is actually doing when it looks idle.

hinkley8mo ago

How many times has hyperthreading been an actual performance benefit in processors? I cannot count how many times an article has come out saying you'll get better performance out of your <insert processor here> by turning off hyperthreading in the BIOS.

It's gotta be at least 2 out of every 3 chip generations going back to the original implementation, where you're better off without it than with.

loeg8mo ago

HT provides a significant benefit to many workloads. The use cases that benefit from actually disabling HT are likely working around pessimal OS scheduler or application thread use. (After all, even with it enabled, you're free to not use the sibling cores.) Otherwise, it is an overgeneralization to say that disabling it will benefit arbitrary workloads.

hedora8mo ago

There’s some argument that you should jam stuff on to as few hyperthread pairs as possible to improve energy efficiency and cache locality.

Of course, if the CPU governor is set to “performance” or “game mode”, then the OS should use as many pairs as possible instead (unless thermal throttling matters; computers are hard).

mkbosmans8mo ago

Especially in HPC there are lots of workloads that do not benefit from SMT. Such workloads are almost always bottlenecked on either memory bandwidth or vector execution ports. These are exactly the resources that are shared between the sibling threads.

So now you have a choice of either disabling SMT in the bios, or make sure the application correctly interprets the CPU topology and only spawns one thread per physical core. The former is often the easier option, both from software development and system administration perspective.

2 more replies

robocat8mo ago

> use cases that benefit from actually disabling HT

Other benefits: per-CPU software licencing sometimes, and security on servers that share CPU with multiple clients.

twoodfin8mo ago

For whatever it’s worth, operational database systems (many users/connections, unpredictable access patterns) are beneficiaries of modern hyperthreading.

I’m familiar with one such system where the throughput benefit is ~15%, which is a big deal for a BIOS flag.

IBM’s POWER would have been discontinued a decade ago were it not for transactional database systems, and that architecture is heavily invested in SMT, up to 8-way(!)

jiggawatts8mo ago

I've noticed an overreliance on throughput as measured during 100% load as the performance metric, which has resulted in hardware vendors "optimising to the test" at the expense of other, arguably more important metrics. For example: single-user latency when the server is just 50% loaded.

2 more replies

tom_8mo ago

Why do they need so many threads? This really feels like they just designed the cpu poorly, in that it can't extract enough parallelism out of the instruction stream already.

(Intel and AMD stopped at 2! Apparently more wasn't worth it for them. Presumably because the cpu was doing enough of the right thing already.)

TristanBall8mo ago

I suspect part of it is licensing games, both in the sense of "avoiding per core license limits" which absolutely matters when your DB is costing a million bucks, and also in the 'enable the highest PVU score per chassis' for ibm's own license farming.

Power systems tend not to be under the same budget constraints as intel, whether thats money, power, heat, whatever, so the cost benifit of adding more sub-core processing for incremental gains is likely different too.

I may have a raft of issues with IBM, and aix, but those Power chips are top notch.

1 more reply

BrendanLongOP8mo ago

To be fair, in most of these tests hyperthreading did provide a significant benefit (in the general CPU stress test, the hyperthreads increased performance by ~66%). It's just confusing that utilization metrics treat hyperthread usage the same as full physical cores.

bee_rider8mo ago

Those weird Xeon Phi accelerators had 4 threads per core, and IIRC needed at least 2 running to get full performance. They were sort of niche, though.

I guess in general parallelism inside a core will either be extracted by the computer automatically with instruction-level-parallelism, or the programmer can tell it about independent tasks, using hyperthreads. So the hyperthread implementations are optimistic about how much progrmmers care about performance, haha.

mkbosmans8mo ago

Sort of niche indeed.

In addition to needing SMT to get full performance, there were a lot of other small details you needed to get right on Xeon Phi to get close to the advertised performance. Think of AVX512 and the HBM.

For practical applications, it never really delivered.

tgma8mo ago

It has a lot to do with your workload as well as if not moreso than the chip architecture.

The primary trade-off is the cache utilization when executing two sets of instruction streams.

hinkley8mo ago

That's likely the primary factor, but then there's thermal throttling as well. You can't run all of the logic units flat out on a bunch of models of CPU.

2 more replies

duped8mo ago

For me today it's definitely a pessimation because I have enough well-meaning applications that spawn `nproc` worker threads. Which would be fine if they're the only process running, but they're not.

hinkley8mo ago

I wrote a little tool for our services that could do basic expression based off of nproc based on an environment variable at startup time.

You could do one thread for every two cores, three threads for every 2 cores, one thread per core ± 1, or both (2n + 1).

Unfortunately the sweet spot based on our memory usage always came out to 1:1, except for a while when we had a memory leak that was surprisingly hard to fix, and we ran n - 1 for about 4 months while a bunch of work and exploratory testing were done. We had to tune in other places to maximize throughput.

toast08mo ago

Wouldn't that be about the same badness without hyperthreads? If you're oversubscribed, there might be some benefit to having fewer tasks, but maybe you get some good throughput with two different application's threads running on opposite hyperthreads.

1 more reply

esseph8mo ago

Intel vs AMD, you'll get a different answer on the hyperthreading question.

https://www.tomshardware.com/pc-components/cpus/zen-4-smt-fo...

toast08mo ago

Going from 1 core to 2 hyperthreads was a big bonus in interactivity. But I think it was easy to get early systems to show worse throughput.

I think there's two kinds of loads where hyperthreads aren't more likely to hurt than help. If you've got a tight loop that uses all the processor execution resources, you're not gaining anything by splitting that in two, it just makes things harder. Or if your load is mostly bound by memory bandwidth without a lot of compute... having more threads probably means you're that much more oversubscribed on i/o and caching.

But a lot of loads are grab some stuff from memory and then do some compute, rinse and repeat. There's a lot of potential for idle time while waiting on a load, being able to run something else during that time makes a lot of sense.

It's worth checking how your load performs with hyperthreads off, but I think default on is probably the right choice.

sroussey8mo ago

Definitely measure both ways and decide.

For many years (still?) it was faster to run your database with hyper threading turned off and your app server with it turned on.

FpUser8mo ago

In the old days it had made the difference between my multimedia game like application not working at all with hyperthreading off to working just fine with it on.

hinkley8mo ago

Yeah when it was one core versus 1.3 cores that's fair. But 3 core machines often did better (or at least more consistently run to run) with HT disabled.

tom_8mo ago

Total throughout has always seemed better with it switched on for me, even for stuff that isn't hyper threading friendly. You get a free 10% at least.

Aissen8mo ago

Funny that it talks about matrixprod, which I think is not that relevant as benchmark — unless you care about x87 performance specifically. I recently sent a pull request to try to address that in a generic manner: https://github.com/ColinIanKing/stress-ng/pull/561

Yet I'm still surprised by this benchmark. On both Zen2 and Zen4 in my tests (5900X from the article is Zen3), matrixprod still benefits from hyperthreading and scales a bit after all the physical cores are filled, unlike what the article results show.

All of this is tangential of course, as I'd tend to agree that CPU utilization% is just an imprecise metric and should only be used as a measure of "is something running".

bob10298mo ago

I think looking at power consumption is potentially a more interesting canary when using very high core count parts.

I've ran some ML experiments on my 5950x and I can tell that the CPU utilization figure is entirely decoupled from physical reality by observing the amount of flicker induced in my office lighting by the PWM noise in the machine. There are some code paths that show 10% utilization across all cores but make the cicadas outside my office window stop buzzing because the semiconductors get so loud. Other code paths show all cores 100% maxed flatline and it's like the machine isn't even on.

N_Lens8mo ago

This has been my experience running production workloads as well. Anytime CPU% goes over 50-60% suddenly it'll spike to 100% rather quickly, and the app/service is unusable. Learned to scale earlier than first thought.

morning-coffee8mo ago

The lie is that hyper thread "cores" are equal to real "cores". Maybe this is what happens when an over 20-year old technology (hack) becomes ubiquitous and gets forgotten about? (We have to rediscover why our performance measurements don't seem to make sense?)

The other thing I think we have a hard time visualizing is that processor is only either executing (100%) or its waiting to execute (0%) and that happens over varying timescales... so trying to assign a % in between inherently means you're averaging over some arbitrary timescale...

fennecfoxy8mo ago

I think it's more for cores, right? % util is just % of idle cycles across all logical cores as far as I know.

It wouldn't really make sense to include all parts of the CPU in the calculation.

fuzzfactor8mo ago

Windows users try this:

Ctrl-Alt-Del then launch TaskManager.

In TaskManager, click the "Performance" tab and see the simple stats.

While on the Performance tab, then click the ellipsis (. . .) menu, so you can then open ResourceMonitor.

Then close TaskManager.

In ResourceMonitor, under the Overview tab, for the CPU click the column header for "Average CPU" so that the processes using the most CPU are shown top-down from most usage to least.

In Overview, for Disk click the Write (B/sec) column header, for Network click Send (B/sec), and for Memory click Commit (KB).

Then under the individual CPU, Memory, Disk, and Network tabs click on the similar column headers. Under any tab now you should be able to see the most prominent resource usages.

Notice how your CPU settles down after a while of idling.

Then click on the Disk tab to focus your attention on that one exclusively.

Let it sit for 5 or 10 minutes then check your CPU usage. See if it's been climbing gradually higher while you weren't looking.

tonymet8mo ago

I like his empirical approach to get to the root significance of the cpu %-age indicator. Software engineers and data analysts take discrete "data" measurements and statistics for granted.

"data" / "stats" are only a report, and that report is often incorrect.

rollcat8mo ago

I'm surprised nobody has mentioned OpenBSD yet.

They've been advocating against SMT for a long while, citing security risks and inconsistent performance gains. I don't know which HW/CPU bug in the long series of rowhammer, meltdown, spectre, etc prompted the action, but they've completely disabled SMT in the default installation at some point.

The core idea by itself is fine: keep the ALUs busy. Maybe security-wise, the present trade-off is acceptable, if you can instruct the scheduler to put threads from the same security domain on the same physical core. (How to tell when two threads are not a threat to each other is left up as an exercise.)

saagarjha8mo ago

The security argument might make sense but OpenBSD is not really the place to take performance advice from

rollcat8mo ago

My original point stands, also per TFA - performance gains from SMT are questionable for certain workloads. Whether OpenBSD prioritises absolute performance is besides the point - they benchmark against their own goals, not someone else's achievements.

whizzter8mo ago

Do people even use or mention OpenBSD out of performance concerns? We all know they prioritize security.

gbin8mo ago

Yeah and those tests don't even trigger some memory or cache contention ...

smallstepforman8mo ago

Read kernel code to see how CPU utilisation is calculated. In essence, count scheduled threads to execute and divide by number of cores. Any latency (eg. wait for memory) is still calculated as busy core.

codedokode8mo ago

A worse lie is memory usage reporting, I think in every major OS it is understated and misreported. In case with Linux, I wanted to know who is using memory, and tried to add PSS values for every process, I never got back the total memory usage. In case with Windows/Mac I judge by screenshot of their tools which show unrealistically small values.

As for the article, the slowdown can be also caused by increased use of shared resources like caches, TLBs, branch predictors.

biggusdickus698mo ago

The memory usage is interesting, where different kind of shared memory is obvious hard to visualize, just two values per process doesn’t say enough.

Most users actually wants a list of ”what can I kill to make the computer faster”, I.e. they want an oracle (no pun) that knows how fast the computer will be if different processes are killed.

HPsquared8mo ago

GPU utilisation as reported in Task Manager also seems quite a big lie, it bears little relation to Watts / TDP.

aaa_20068mo ago

CPU utilization alone is misleading. Pair it with per core load average or runqueue length to see how threads are actually queuing. That view often reveals the real bottleneck, whether it is I/O, memory, or scheduling delays.

steventhedev8mo ago

%cpu is misleading at best, and should largely be considered harmful.

System load is well defined, matches user expectations, and covers several edge cases (auditd going crazy, broken CPU timers, etc).

pama8mo ago

Wait until you encounter GPU utilization. You could have two codes listing 100% utilization and have well over 100x performance difference from each other. The name of these metrics creates natural assumptions that are just wrong. Luckily it is relatively easy to estimate the FLOP/s throughput for most GPU codes and then simply compare to the theoretical peak performance of the hardware.

spindump89308mo ago

Don't forget that theoretical peak performance is (probably) half the performance listed on the nvidia datasheet because they used the "with sparsity" numbers! I've seen this bite folks who miss the * on the figure or aren't used to reading those spec sheets.

BrendanLongOP8mo ago

Yeah, the obvious thing with processors is to do something similar:

(1) Measure MIPS with perf (2) Compare that to max MIPS for your processor

Unfortunately, MIPS is too vague since the amount of work done depends on the instruction, and there's no good way to measure max MIPS for most processors. (╯°□°)╯︵ ┻━┻

saagarjha8mo ago

If your workload is compute bound, of course. Sometimes you want to look at bandwidth instead.

pama8mo ago

Of course. Lots of useful metrics exist to help tweak code performance without always needing to go into detailed profiler traces. GPU utilization is a particularly poor metric in helping much, except for making sure the code made it to the GPU somehow :-)

PathOfEclipse8mo ago

I think it was always a mistake to pretend hyperthreading doubles your core count. I always assumed it was just due to laziness; the operating system treats a hyperthreaded core as two "virtual cores" and schedules as two cores, so then every other piece of tooling sees double the number of actual cores. There's no good reason I know of that a CPU utilization tool shouldn't use real cores when calculating percentages. But, maybe that's hard to do given how the OS implements hyperthreading.

fluoridation8mo ago

>There's no good reason I know of that a CPU utilization tool shouldn't use real cores when calculating percentages

On AMD, threads may as well be cores. If you take a Ryzen and disable SMT, you're basically halving its parallelism, at least for some tasks. On Intel you're just turning off an extra 10-20%.

PathOfEclipse8mo ago

Can you provide some links for this? A quick web search turns this up at near the top from 2024:

https://www.techpowerup.com/review/amd-ryzen-9-9700x-perform...

The benchmarks show a 10% drop in "application" performance when SMT is disabled, but an overall 1-3% increase in performance for games.

From a hardware perspective, I can't imagine how it could be physically possible to double performance by enabling SMT.

1 more reply

throwmeaway2228mo ago

Yeah, this is what we all talked about when hyperthreading was first invented in 2000 era.

1gn158mo ago

Love that this website is public domain. Thank you, Brendan!

bdhcuidbebe8mo ago

Thats some strong words about not RTFM.

kunley8mo ago

tl;dr: guy vibecodes a thing to measure something he doesn't fully understand and then realizes his methodology is wrong. Ends up with a catchy "X is a lie" title, which itself can be considered a lie.

timzaman8mo ago

What's become of hacker news that this is #2 post ? This is basic knowledge any programmer gets in their first few years..

therealdrag08mo ago

It’s a big industry with a wide range of knowledge levels.

saagarjha8mo ago

I encounter very few programmers who learn this.

j / k navigate · click thread line to collapse

168 comments

ot8mo ago

Utilization is not a lie, it is a measurement of a well-defined quantity, but people make assumptions to extrapolate capacity models from it, and that is where reality diverges from expectations.

Unfortunately, the suggested approach is not that accurate either, because it hinges on two inherently unstable concepts

> Benchmark how much work your server can do before having errors or unacceptable latency.

> Report how much work your server is currently doing.

There is rarely a stable definition of "work". Is it RPS? Request cost can vary even throughout the day. Is it instructions? Same, the typical IPC can vary.

eklitzke8mo ago

tracker18mo ago

> There is rarely a stable definition of "work". Is it RPS? Request cost can vary even throughout the day. Is it instructions? Same, the typical IPC can vary.

In any case, for me... >= 65% CPU load for >= 30m/day means it's at 100% effective utilization, and needs expansion relatively soon. Just my own take.

everforward8mo ago

> In any case, for me... >= 65% CPU load for >= 30m/day means it's at 100% effective utilization, and needs expansion relatively soon.

There are definitely things that isn't true of though, and your metrics read about right to me.

p12tic8mo ago

> Today it's a bit more complicated when you have servers with 100+ cores as an option for under $30k (guestimate based on $10k CPU price).

If one can buy used, then previous generation 128C 256T epyc server is less than $5k. For homelabs that can accept non-rackmount gear it's less than $3k.

jimmySixDOF8mo ago

SAI_Peregrinus8mo ago

That's all true, and the % part is still a lie. As you note, CPU utilization isn't linear, and percentages are linear measures. CPU utilization isn't a lie, % CPU utilization is.

ot8mo ago

It is a linear percentage of the amount of time the CPU is not idle. It is not linear in the amount of useful work, but that's not what "utilization" means.

The lie is the assumption that CPU time is linear in useful work, but that has nothing to do with the definition of utilization, it's just something that people sometimes naively believe.

> CPU utilization isn't a lie, % CPU utilization is

What do you mean by this? Utilization is, by definition, a ratio. % just determines that the scale is in [0, 100].

2 more replies

SirMaster8mo ago

inetknght8mo ago

1 more reply

saagarjha8mo ago

throwaway311318mo ago

Percent utilization for most operating systems is the amount of time the idle task is not scheduled. So for both workloads the idle task was never scheduled, hence 100% "utilization".

BrendanLongOP8mo ago

1 more reply

kqr8mo ago

The %CPU suggestions ran contrary to (and were much more conservative than) the "old wisdom" that would otherwise have been used. It worked out great at much lower cost than otherwise.

What I'm trying to say is you shouldn't be afraid of using semi-crappy indicators just because they're semi-crappy. If it's the best you got it might be good enough anyway.

therealdrag08mo ago

> semi-crappy indicator … good enough.

arccy8mo ago

That works ok when you have evenly distributed load (which you want / would hope to have), much less so when your workload is highly unbalanced.

mayama8mo ago

saagarjha8mo ago

40% seems quite lightly utilized tbh

cpncrunch8mo ago

I tend to use 50% as a soft target, which seems like a good compromise. Sometimes it may go a little bit over that, but if it's occasional it shouldn't be an issue.

kqr8mo ago

[1]: https://erikbern.com/assets/wait-time-2.png

paravz8mo ago

Cpu utilization %% needs to be contrasted with a "business" metric like latency or RPS. Depending on the environment and hardware 40% can be too utilized or way underutilized

1 more reply

zekrioca8mo ago

I noticed exactly the same thing. The author is saying something that has been repeatedly written in queueing theory books for decades, still they are noticing this only now.

mustache_kimono8mo ago

[0]: https://www.brendangregg.com/blog/2017-05-09/cpu-utilization...

4gotunameagain8mo ago

What's up with Brendans and CPU utilisation concerns, any Brendan to shine some light ?

BrendanLongOP8mo ago

I'd love to explain, but you'd need to change your name to Brendan first.

PaulKeeble8mo ago

c2h5oh8mo ago

RaftPeople8mo ago

> To complicate things more HT performance varies wildly between CPU architectures and workloads.

IBM's Power cpu's have also traditionally done a great job with SMT compared to Intel's implementation.

shim__8mo ago

Whats the difference between Intels and AMDs approach?

1 more reply

magicalhippo8mo ago

For memory-bound applications the scaling can be much better. A renderer I worked on was primarily memory-bound walking the accelerator structure, and saw 60-70% increase from hyperthreads.

But overall yeah.

Sohcahtoa828mo ago

Back when I got an i7-3770K (4C/8T), I did a very basic benchmark using POV-Ray.

Going from 1 thread to 2 threads doubled the speed as expected. Going from 2 to 4 doubled it again. Going from 4 to 8 was only ~15% faster.

Side note, I should run that POV-Ray test again. It's been years since I've even use POV-Ray.

tgma8mo ago

saghm8mo ago

nayuki8mo ago

1 more reply

BobbyTables28mo ago

That’s perfect!

Especially when it come to those advertisements “6 large rolls == 18 normal rolls”.

Sure it might be thicker but nobody wipes their butt with 1/3 a square…

skeezyboy8mo ago

> he came up with the analogy that it's like 2-ply toilet paper.

as in youd only use it to wipe excrement from around your sphincter

BrendanLongOP8mo ago

sroussey8mo ago

Eh, what’s a thread really? It’s a term for us humans.

The difference between two threads and one core or two cores with shared resources?

Nothing is really all that neat and clean.

It more of a 2 level NUMA type architecture with 2 sets of 6 SMP sets of 2.

The scheduler may look at it that way (depending), but to the end user? Or even to most of the system? Nah.

1 more reply

sroussey8mo ago

Will be interesting when (if?) Intel ships software defined cores which are the logical inverse of hyper threading.

Instead of having a big core with two instruction pipelines sharing big ALUs etc, they have two (or more) cores that combine resources and become one core.

Almost the same, yet quite different.

https://patents.google.com/patent/EP4579444A1/en

tgma8mo ago

There was the dreaded AMD FX chip which was advertised as 8 core, but shared functional units. Got sued, etc.

1 more reply

Neil448mo ago

bboreham8mo ago

Worth noting that the major clouds will sell this as 24 "vcpus".

dragontamer8mo ago

There's many ways CPU utilization fails to work as expected.

I didn't expect an article on this style. I was expecting the normal Linux/Windows utilization but wtf it's all RAM bottlenecked and the CPU is actually quiet and possibly down clocking thing.

CPU Utilization is only how many cores are given threads to run by the OS (be it Windows or Linux). Those threads could be 100% blocked on memcpy but that's still CPU utilization.

-------

freehorse8mo ago

Author discovers that performance does not scale proportionally to %CPU utilisation, and gets instead to the conclusion that %CPU utilisation is a lie.

saagarjha8mo ago

Apple silicon downclocks quite a lot especially if you have a passively cooled machine

freehorse8mo ago

judge1238mo ago

This hits so close to home. I once tried to explain to a manager that a server at 60% utilization had zero room left, and they looked at me like I had two heads. I wish I had this article back then!

hinkley8mo ago

You also want to hit him with queueing theory.

The rule of thumb is 60% is zero, and 80% is the inflection point where delays go exponential.

The biggest cluster I ran, we hit about 65% CPU at our target P95 time, which is pretty much right on the theoretical mark.

BrendanLongOP8mo ago

2 more replies

Ambroisie8mo ago

Do you have a link to a more in-depth analysis of the queuing theory for these numbers?

1 more reply

PunchyHamster8mo ago

that entirely depends on workload. especially now when average server CPUs start at 32 cores

0xbadcafebee8mo ago

CCs8mo ago

BrendanLongOP8mo ago

Fun data point though, I just ran three data points of the Phoronix nginx benchmark and got these results:

- Pinned to 6 cores: 28k QPS

- Pinned to 12 cores: 56k QPS

- All 24 cores: 62k QPS

I'm not sure how this applies to realistic workloads where you're using all of the cores but not maxing them out, but it looks like hyperthreading only adds ~10% performance in this case.

3 more replies

kristopolous8mo ago

Tried to explain this in a job interview 5 years ago. They thought I was a bullshitter

bionsystem8mo ago

Happened to me on a different topic, felt bad for way too long ; in hindsight I'm pretty sure I dodged a bullet.

kristopolous8mo ago

And if you have a dual epyc board with 1,024 threads, being able to parallelize a solution and design things for cache optimization, this isn't meaningless.

It's a weak classifier - if you really think I'm going to be doing a lexical sort in like O(n^3) like some kind of clown, I don't know what you're hiring here.

Found out later he scored me "2/5".

Alright, cool.

1 more reply

swiftcoder8mo ago

(chat, it was not. busy waiting is CPU utilisation too)

kristianp8mo ago

How do you measure the amount of busy waiting?

swiftcoder8mo ago

ChaoPrayaWave8mo ago

These days I treat CPU usage as just a hint, not a conclusion. I also look at response times, queue lengths, and try to figure out what the app is actually doing when it looks idle.

hinkley8mo ago

It's gotta be at least 2 out of every 3 chip generations going back to the original implementation, where you're better off without it than with.

loeg8mo ago

hedora8mo ago

There’s some argument that you should jam stuff on to as few hyperthread pairs as possible to improve energy efficiency and cache locality.

Of course, if the CPU governor is set to “performance” or “game mode”, then the OS should use as many pairs as possible instead (unless thermal throttling matters; computers are hard).

mkbosmans8mo ago

2 more replies

robocat8mo ago

> use cases that benefit from actually disabling HT

Other benefits: per-CPU software licencing sometimes, and security on servers that share CPU with multiple clients.

twoodfin8mo ago

For whatever it’s worth, operational database systems (many users/connections, unpredictable access patterns) are beneficiaries of modern hyperthreading.

I’m familiar with one such system where the throughput benefit is ~15%, which is a big deal for a BIOS flag.

IBM’s POWER would have been discontinued a decade ago were it not for transactional database systems, and that architecture is heavily invested in SMT, up to 8-way(!)

jiggawatts8mo ago

2 more replies

tom_8mo ago

Why do they need so many threads? This really feels like they just designed the cpu poorly, in that it can't extract enough parallelism out of the instruction stream already.

(Intel and AMD stopped at 2! Apparently more wasn't worth it for them. Presumably because the cpu was doing enough of the right thing already.)

TristanBall8mo ago

I may have a raft of issues with IBM, and aix, but those Power chips are top notch.

1 more reply

BrendanLongOP8mo ago

bee_rider8mo ago

Those weird Xeon Phi accelerators had 4 threads per core, and IIRC needed at least 2 running to get full performance. They were sort of niche, though.

mkbosmans8mo ago

Sort of niche indeed.

For practical applications, it never really delivered.

tgma8mo ago

It has a lot to do with your workload as well as if not moreso than the chip architecture.

The primary trade-off is the cache utilization when executing two sets of instruction streams.

hinkley8mo ago

That's likely the primary factor, but then there's thermal throttling as well. You can't run all of the logic units flat out on a bunch of models of CPU.

2 more replies

duped8mo ago

For me today it's definitely a pessimation because I have enough well-meaning applications that spawn `nproc` worker threads. Which would be fine if they're the only process running, but they're not.

hinkley8mo ago

I wrote a little tool for our services that could do basic expression based off of nproc based on an environment variable at startup time.

You could do one thread for every two cores, three threads for every 2 cores, one thread per core ± 1, or both (2n + 1).

toast08mo ago

1 more reply

esseph8mo ago

Intel vs AMD, you'll get a different answer on the hyperthreading question.

https://www.tomshardware.com/pc-components/cpus/zen-4-smt-fo...

toast08mo ago

Going from 1 core to 2 hyperthreads was a big bonus in interactivity. But I think it was easy to get early systems to show worse throughput.

It's worth checking how your load performs with hyperthreads off, but I think default on is probably the right choice.

sroussey8mo ago

Definitely measure both ways and decide.

For many years (still?) it was faster to run your database with hyper threading turned off and your app server with it turned on.

FpUser8mo ago

In the old days it had made the difference between my multimedia game like application not working at all with hyperthreading off to working just fine with it on.

hinkley8mo ago

Yeah when it was one core versus 1.3 cores that's fair. But 3 core machines often did better (or at least more consistently run to run) with HT disabled.

tom_8mo ago

Total throughout has always seemed better with it switched on for me, even for stuff that isn't hyper threading friendly. You get a free 10% at least.

Aissen8mo ago

All of this is tangential of course, as I'd tend to agree that CPU utilization% is just an imprecise metric and should only be used as a measure of "is something running".

bob10298mo ago

I think looking at power consumption is potentially a more interesting canary when using very high core count parts.

N_Lens8mo ago

morning-coffee8mo ago

fennecfoxy8mo ago

I think it's more for cores, right? % util is just % of idle cycles across all logical cores as far as I know.

It wouldn't really make sense to include all parts of the CPU in the calculation.

fuzzfactor8mo ago

Windows users try this:

Ctrl-Alt-Del then launch TaskManager.

In TaskManager, click the "Performance" tab and see the simple stats.

While on the Performance tab, then click the ellipsis (. . .) menu, so you can then open ResourceMonitor.

Then close TaskManager.

In ResourceMonitor, under the Overview tab, for the CPU click the column header for "Average CPU" so that the processes using the most CPU are shown top-down from most usage to least.

In Overview, for Disk click the Write (B/sec) column header, for Network click Send (B/sec), and for Memory click Commit (KB).

Then under the individual CPU, Memory, Disk, and Network tabs click on the similar column headers. Under any tab now you should be able to see the most prominent resource usages.

Notice how your CPU settles down after a while of idling.

Then click on the Disk tab to focus your attention on that one exclusively.

Let it sit for 5 or 10 minutes then check your CPU usage. See if it's been climbing gradually higher while you weren't looking.

tonymet8mo ago

I like his empirical approach to get to the root significance of the cpu %-age indicator. Software engineers and data analysts take discrete "data" measurements and statistics for granted.

"data" / "stats" are only a report, and that report is often incorrect.

rollcat8mo ago

I'm surprised nobody has mentioned OpenBSD yet.

saagarjha8mo ago

The security argument might make sense but OpenBSD is not really the place to take performance advice from

rollcat8mo ago

whizzter8mo ago

Do people even use or mention OpenBSD out of performance concerns? We all know they prioritize security.

gbin8mo ago

Yeah and those tests don't even trigger some memory or cache contention ...

smallstepforman8mo ago

codedokode8mo ago

As for the article, the slowdown can be also caused by increased use of shared resources like caches, TLBs, branch predictors.

biggusdickus698mo ago

The memory usage is interesting, where different kind of shared memory is obvious hard to visualize, just two values per process doesn’t say enough.

Most users actually wants a list of ”what can I kill to make the computer faster”, I.e. they want an oracle (no pun) that knows how fast the computer will be if different processes are killed.

HPsquared8mo ago

GPU utilisation as reported in Task Manager also seems quite a big lie, it bears little relation to Watts / TDP.

aaa_20068mo ago

steventhedev8mo ago

%cpu is misleading at best, and should largely be considered harmful.

System load is well defined, matches user expectations, and covers several edge cases (auditd going crazy, broken CPU timers, etc).

pama8mo ago

spindump89308mo ago

BrendanLongOP8mo ago

Yeah, the obvious thing with processors is to do something similar:

(1) Measure MIPS with perf (2) Compare that to max MIPS for your processor

Unfortunately, MIPS is too vague since the amount of work done depends on the instruction, and there's no good way to measure max MIPS for most processors. (╯°□°)╯︵ ┻━┻

saagarjha8mo ago

If your workload is compute bound, of course. Sometimes you want to look at bandwidth instead.

pama8mo ago

PathOfEclipse8mo ago

fluoridation8mo ago

>There's no good reason I know of that a CPU utilization tool shouldn't use real cores when calculating percentages

On AMD, threads may as well be cores. If you take a Ryzen and disable SMT, you're basically halving its parallelism, at least for some tasks. On Intel you're just turning off an extra 10-20%.

PathOfEclipse8mo ago

Can you provide some links for this? A quick web search turns this up at near the top from 2024:

https://www.techpowerup.com/review/amd-ryzen-9-9700x-perform...

The benchmarks show a 10% drop in "application" performance when SMT is disabled, but an overall 1-3% increase in performance for games.

From a hardware perspective, I can't imagine how it could be physically possible to double performance by enabling SMT.

1 more reply

throwmeaway2228mo ago

Yeah, this is what we all talked about when hyperthreading was first invented in 2000 era.

1gn158mo ago

Love that this website is public domain. Thank you, Brendan!

bdhcuidbebe8mo ago

Thats some strong words about not RTFM.

kunley8mo ago

timzaman8mo ago

What's become of hacker news that this is #2 post ? This is basic knowledge any programmer gets in their first few years..

therealdrag08mo ago

It’s a big industry with a wide range of knowledge levels.

saagarjha8mo ago

I encounter very few programmers who learn this.

j / k navigate · click thread line to collapse