> Modern CPUs measure their temperature and clock down if they get too hot, don't they?
Yes. It's rather complex now and it involves the motherboard vendor's firmware. When (not if) they get that wrong CPUs burn up. You're going to need some expertise to analyze this.
That framing doesn't do him and the team justice. There is (or better, was) a 3.5h long story about NVIDIA GPUs finding their ways illegaly from the US to China, which got taken down by a malicious DMCA claim from Bloomberg. It is quite interesting to watch (Can be found archive.org).
GN is one of the last pro-consumer outlets, that keep on digging and shaking the tree big companys are sitting on.
Not everywhere:
https://archive.org/details/the-nvidia-ai-gpu-black-market-i...
Despite this, the overtemperature protection of the CPUs should have protected the CPUs and prevent any kind of damage like this.
Besides the system that varies continuously the clock frequency to keep the CPU within the current and power consumption limits, there is a second protection that stops temporarily the clock when a temperature threshold is exceeded. However, the internal temperature sensors of the CPUs are not accurate, so the over-temperature protection may begin to act only at a temperature that is already too high.
So these failures appear to have been caused by a combination of not using the appropriate coolers for a 200 W CPU, combined with the fact that AMD advertises a 200-W CPU as an 170-W CPU, fooling naive customers into believing that smaller coolers are acceptable, and with either some kind of malfunction of the over-temperature protection in these CPUs or with a degradation problem that happens even within the nominal temperature range, but at its upper end.
Noctua's CPU compatibility page lists the NH-U9s as "medium turbo/overclocking headroom" for the 9950X [0]. I don't think it's fair to suggest their cooler choice is the problem here.
GN is unique in paying for silicon-level analysis of failures.
der8auer also contributes a lot to these stories.
I tend to wait for all 3 of their analyses, because each adds a different "hard-won" perspective.
This was Gordon's style, and Steve is continuing it. He has the courage to hit Bloomberg offices with a cameraman, so I don't think his words ring hollow.
We need that kind of in your face, no punches held back type of reporting when compared to "measured professionals".
AMD is somewhat worse than Intel as their DDR5 memory bus is very "twitchy" making it hard to get the highest DDR5 timings, especially with multiple DIMMs per channel.
I got 2x32GB sticks of RAM with the plan to throw in another two sticks later. I had no idea that was now a bad plan. I wish manufacturers would have just put 2 DIMM slots on motherboards as a “warning.”
> We use a Noctua cooling solution for both systems. For the 1st system, we mounted the heat sink centred. For the 2nd system, we followed Noctua's advice of mounting things offset towards what they claim to be the hotter side of the CPU. Below is a picture of the 2nd system without the heat sink which shows that offset. Note the brackets and their pins, those pins are where the heat sink's pressure gets centred. Also note how the thermal paste has been squeezed away from that part, but is quite thick towards the left.
Probably there's less paste remaining on the south end of the CPU because that's where the mounting force is greatest.
If anything, there's too much paste remaining on the center/north end of the CPU. Paste exists simply to bridge the roughness of the two metal surfaces, too much paste is a bad sign.
My guess is that the MB was oriented vertically and that big heavy heat sink with the large lever arm pulled it away from the center and north side of the CPU.
IMO, the CPU is still responsible for managing its power usage to live a long life. The only effect of an imperfect thermal solution ought to be proportionally reduced performance.
TDP numbers are completely made up. They don’t correspond to watts of heat, or of anything at all! They’re just a marketing number. You can't use them to choose the right cooling system at all.
https://gamersnexus.net/guides/3525-amd-ryzen-tdp-explained-...
> The thermal solution bundled with the CPUs is not designed to handle the thermal output when all the cores are utilized 100%. For that kind of load, a different thermal solution is strongly recommended (paraphrased).
I never used the stock cooler bundled with the processor, but what kind of dark joke is this?
You don't even need to change the actual cooler since for AMD CPUs you can pretty much customize the TDP whatever way you want, and by default they run well above their efficiency curve. For example, my 7600X has a default TDP of 105W but I run it in Eco Mode (65W) with undervolt and I barely lose any performance. Even if I did no undervolt, running the CPU in Eco Mode is generally preferable since the performance loss is still negligible (~5%).
The Conroe Intel era was amazing for the time.
https://hwbusters.com/cpu/amd-ryzen-9-9950x-cpu-review-perfo...
Couldn't this count as false/misleading advertizing though?
But yeah, TDP means nothing. If you stick plenty of cooling and run the right motherboard board revision your "TDP" can be whatever you want it to be until the thing melts.
For what, exactly? TDP stands for "thermal design power" - nothing in that means peak power or most power. It stopped being meaningful when CPUs learned to vary clock speeds and turbo boost - what is the thermal design target at that point, exactly? Sustained power virus load?
Inside Linden, about 90% of meetings were held in-world, so we constantly had the Second Life viewer up. About three months later our Thinkpads started failing. Apparently they thought people who would a) buy a thinkpad and b) use it to play video games wouldn't be playing video games 12 hours per day (though as many have pointed out, does one "play" Second Life? especially if you're using it for work.)
After 3 months of use, the Second Life client had caused sufficient heating cycles so as to delaminate the PCB under the GPU.
I'm sort of proud of this. Our software was dangerous.
Here's a link to Philip's avatar, which he intentionally kept basic for quite some time: https://community.secondlife.com/forums/topic/517881-help-me...
And here's a shot of mine: https://www.flickr.com/photos/opensourceobscure/2476204733/i...
And about in-world meetings. One thing Second Life did VERY WELL was it was always clear who was speaking. If someone was talking, there were green arrows sort of exploding out of their avatar's head. You couldn't miss it. Web-Ex at the time was HORRIBLE in this regard. Teams, Google Meet and Zoom are a little better than Web-Ex, but when meeting in Second Life, you could adjust your camera to get a good view of everyone in the meeting which also helped out.
These big x86 CPUs in stock configuration can throttle down to speeds where they can function with entirely passive cooling, so even if the cooler was improperly mounted, they'd only throttle.
All that to say, if GMP is causing the CPU to fry itself, something went very wrong, and it is not user error or the room being too hot.
As in... what, AMD K6 / early Pentium 4 days was the last time I remember hearing about cpu cooler failing and frying a cpu?
I once worked on a piece of equipment that was running awful slow. The CPU was just not budging from its base clock of 700Mhz. As I was removing the stock Intel cooler, I noticed it wasn't seated fully. Once I removed it and looked I saw a perfectly clean CPU with no residue. I looked at the HSF, the original thermal paste was in pristine condition.
I remounted the HSF and it worked great. It ran 100% throttled for seven years before I touched it.
Built-in thermal sensing came later.
If it can, then the hardware is to blame.
The Asus Prime B650M motherboards they are using aren't exactly high end.
"According to new details from Tech Yes City, the problem stems from the amperage (current) supplied to the processor under AMD's PBO technology. Precision Boost Overdrive employs an algorithm that dynamically adjusts clock speeds for peak performance, based on factors like temperature, power, current, and workload. The issue is reportedly confined to ASRock's high-end and mid-range boards, as they were tuned far too aggressively for Ryzen 9000 CPUs."
https://www.tomshardware.com/pc-components/cpus/asrock-attri...
The 9950X's TDP (Thermal Design Power) is 170 W, its default socket power is 200 W [2], and with PBO (Precision Boost Overdrive) enabled it's been reported to hit 235 W [3].
[1] https://www.overclockersclub.com/reviews/noctua_nh_u9s_cpu_c...
[2] https://hwbusters.com/cpu/amd-ryzen-9-9950x-cpu-review-perfo...
[3] https://www.tomshardware.com/pc-components/cpus/amd-ryzen-9-...
Reviewers and sellers do, though. Here are a few more: [1][2][3][4]
The highest rating is from [1], which says
You cannot access the TDP guide from here, but we will tell you that it displays 140W TDP; however, it also says you can overclock that to closer to 160W or 180W TDP overall.
AVADirect advertises it as good for 115W [4].
It also beggars belief that a single 92mm fan would suffice to cool a 9950X, when the best 120 and 140 mm air coolers just barely reach 240W [5]. The only Noctua in that review, the 140mm NH-D15S, gets to 233W.
> They say it's fine, with "medium turbo/overclocking headroom"
Hopefully just an innocent mistake...
[1] https://www.tweaktown.com/reviews/7038/noctua-nh-u9s-cpu-coo...
[2] https://www.hardwareslave.com/reviews/cooling/noctua-nh-u9s-...
[3] https://www.frostytech.com/articles/2781/index.html
[4] https://www.avadirect.com/NH-U9S-chromax-black-125mm-Height-...
[5] https://www.tomshardware.com/pc-components/air-cooling/therm...
It sounds like the user likely did the opposite of the "offset seating" of the heatsink that Noctua recommended.
[0] https://upload.wikimedia.org/wikipedia/commons/2/2d/Socket_A...
I feel like if this was heat related, the overall CPU temperature should still somewhat slowly creep up, thereby giving everything enough time for thermal throttling. But their discoloration sure looks like a thermal issue, so I wonder why the safety features of the CPU didn't catch this...
My best understanding of the avx-512 'power license' debacle on Intel CPUs was that the processor was actually watching the instruction stream and computing heuristics to lower core frequency before reaching avx512 or dense-avx2 instructions. I guessed they knew or worried that even a short large-vector stint would fry stuff...
Apparently voltage and thermal sensor have vastly improved and looking at the crazy swings on NVIDIA GPU's clocks seem to agree with this :-)
It doesn't strike me as odd that running an extremely power-heavy load for months continuously on such configurations eventually failed.
You overclock as a teen so that as an adult you know to verify your CPU's voltage, clock speeds, and temperature at a minimum when you build your own system.
They made no mention of monitoring of CPU temperature, ECC corrected/detected errors, or throttling. They then ran CPU benchmark loads for several months on the system.
"The so-called TDP of the Ryzen 9950X is 170W. The used heat sinks are specified to dissipate 165W, so that seems tight."
Yikes. You need a heatsink rated much higher. These CPUs were overheated for months.
At stock settings CPUs will boost depending on many factors. Once one of several different limits is hit the CPU will not boost as high, trying to find a steady state where it stays below Tjmax.
Note the following: the 9950x does not come with a stock cooler. AMD recommends water cooling for the 9950x. Transistor lifetime decreases exponentially with temperature.
I'd expect that a 9950x under sustained load paired with a "165W" cooler would not only not boost, but would throttle to below base clocks.
In the case of CPU cooling, I don't agree that relying on the CPU's thermal safety nets to continuously regulate the system to avoid damage is good practice. With additional cooling to ensure it never reaches Tjmax, this also will result in better CPU performance, a tangible benefit.
Had the author monitored his systems, he would have observed high temperature and throttling. Yes, in 2025 it's arguable that a CPU's safety net should be reliable wrt temperature, even if you run with no heatsink at all. I also agree that TDP specifications are unclear.
But the bottom line is that you should pay the extra $100 or so to cool your CPU properly. It will be faster and more reliable.
Please take care of your equipment; do not take it for granted.
That's when I discovered actually ancient term "power virus". Anyway, after talking to different people I dismissed this weird behavior and moved on.
Reading this makes me worry I actually burned mobo in that testing.
Iirc the FFT step uses AVX, and on Zen 5 that’ll be AVX-512. It should keep 100% of the required data in L1 caches, so you’re keeping the AVX units busy literally 100% of the time if things are working right. The rest of the core will be cold/inactive, so if you’re dumping an entire core’s worth of power into a teeny tiny ALU, which is gonna result in high temps. Most (all?) processors downclock under heavy AVX load, sometimes by as much a 1GHz (compared to max boost), because a) the crazy high temperatures results in more instability at higher frequencies, and b) if the clocks were kept high, temperatures would get even higher.
I've heard some really wild noises coming out of my zen4 machine when I've had all cores loaded up with what is best described as "choppy" workloads where we are repeatedly doing something like a parallel.foreach into a single threaded hot path of equal or less duration as fast as possible. I've never had the machine survive this kind of workload for more than 48 hours without some kind of BSOD. I've not actually killed a cpu yet though.
Then you shouldn't trust the results of your work either, as that's indicative of a CPU that's producing incorrect results. I suggest lowering the frequency or even undervolting if necessary until you get a stable system.
...and yes, wildly fluctuating power consumption is even more challenging than steady-state high power, since the VRMs have to react precisely and not overshoot or undershoot, or even worse, hit a resonance point. LINPACK, one of the most demanding stress tests and benchmarks, is known for causing crashes on unstable systems not when it starts each round, but when it stops.
Randomly flipped genome bits could even be beneficial for escaping local minima and broken RNG in evolutionary algorithms. One bad evaluation won't throw the whole thing off. It's gotta be bad constantly.
1. Evaluate population of candidates in parallel
2. Perform ranking, mutation, crossover, and objective selection in serial
3. Go to 1.
I can very accurately control the frequency of the audible PWM noise by adjusting the population size.
But doesn't the hardware "overclock" and "overvolt" automatically these days?
This reminds me of the Intel CPUs with similar problems a year ago, and AFAIK it was caused by excessive voltage: https://news.ycombinator.com/item?id=41039708
If it's done by the manufacturer it's within spec of course. As designed.
The overclock game was all about running stuff out of spec and getting more performance out of it than it was supposed to create.
They do, but the thermal sensors are spread out a bit. It could be that there's a sudden spot heating happening that's not noticed by one of the sensors in time.
Overall, there is a continued challenge with CPU temperatures that requires much tighter tolerances both in the thermal solution. The torque specs need to be followed and verified that they were met correctly in manufacturing.
Having read all I can on the issue its largely been ignored by AMD.
If its some kind of thermal runaway issue that would not surprise me.
I only realized this happened because every time I had ever upgraded firmware, I always had to go back and set XMP settings, one or two other things, and than cpu virt option.
Ironically, if these failures are due to excessive automatic overvolting like what happened with Intel's a year ago), worse cooling would cause the CPU to hit thermal limits and slow down before harmful voltages are reached. Conversely, giving the CPU great cooling will make it think it can go faster (and with more voltage) since it's still not at the limit, and it ends up going too far and killing itself.
I have no idea why they do it this way. Everything they list must be multiplied by about 1.35.
For example, a 170W TDP CPU requires 230W of dissipation. Not 170W.
AMD fans don't give a damn about that though. "It's just fine" (tm).
A rule of thumb I use for cooling is, you can rarely have too much. You should over-engineer that aspect of your systems. That and the power supply.
I have a 7950x, with a water block capable of sinking up to 300W. Under heavy load, I hear the radiator fans spinning up, and I see the cpu temp hover around 90-93 C. That is ok, though cooler would be better. My next build (this one is 2 years old) will also use a water block, but with a higher flow rate, and a better radiator system. I like silent systems, though I don't like the magic smoke being released from components.
Take the AlphaServer DS25. It has wires going from the power supply harness to the motherboard that are thick enough to jump a car. The traces on the motherboard are so thick that pictures of the light reflecting off of them are nothing like a modern motherboard. The two CPUs take 64 watts each.
Now we have AMD CPUs that can take 170 watts? That's high, but if that's what the motherboards are supposed to be able to deliver, then the pins, socket and pads should have no problem with that.
Where's AMD's testing? Have they learned nothing watching Intel (almost literally) melt down?
I am not involved in power VRM for modern moderboards. But I can imagine they are some some smart stuff like compensating for transport losses by increasing the voltage somewhat at the VRM so the designed voltage still outputs at the CPU. Of course this will cause some heating in the motherboard but it's probably easily controlled.
In the day of the alpha that kind of thing would have been science fiction so they had no alternative but to minimise losses. You can't use a static overvoltage because then when the load drops the voltage coming out will be too high (transport loss depends on current).
Also, in those days copper cost a fraction of what it costs now so with any problem just doing 'moah copper' was an easy solution. Especially on server hardware like the Alpha with big markup.
And server hardware is always overengineered of course. Precisely to prevent long-term load problems like this.
Also, take a look at a delidded 9950; the two cpu chiplets are to one side, the i/o chiplet is in the middle, and the other side is a handful of passives. Offsetting the heatsink moves the center of the heatsink 7mm towards the chiplets (the socket is 40mm x 40mm), but there's still plenty of heatsink over the top of the i/o chiplet.
This article has some decent pictures of delidded processors https://www.tomshardware.com/pc-components/overclocking/deli...
Everything is offset towards one side and the two CPU core clusters are way towards the edge, offset cooling makes sense regardless of usage.
> What is GMP?
> The GNU Multiple Precision Arithmetic Library
> GMP is a free library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating-point numbers. There is no practical limit to the precision except the ones implied by the available memory in the machine GMP runs on. GMP has a rich set of functions, and the functions have a regular interface.
Many languages use it to implement long integers. Under the hood, they just call GMP.
IIUC the problem is related to the test suit, that is probably very handy if you ever want to fry an egg on top of your micro.
I just wanted to find out what GMP is.