GMP damaging Zen 5 CPUs? (opens in new tab)

(gmplib.org)

249 pointssequin9mo ago233 comments

233 comments

As there is ongoing drama with Zen 5 and power issues, there are people with the instruments and the motivation to investigate this. You should consider contacting Gamers Nexus, and help them to get your test suite running. They can measure power draw and do a thermal analysis of this CPU, and they'd likely be eager to do it, given the possibility of making a bunch of dramatic YouTube content about design flaws in widely used hardware. That's pretty much their whole schtick in recent years.

> Modern CPUs measure their temperature and clock down if they get too hot, don't they?

Yes. It's rather complex now and it involves the motherboard vendor's firmware. When (not if) they get that wrong CPUs burn up. You're going to need some expertise to analyze this.

MrGilbert9mo ago

> [...] a bunch of dramatic YouTube content [...]

That framing doesn't do him and the team justice. There is (or better, was) a 3.5h long story about NVIDIA GPUs finding their ways illegaly from the US to China, which got taken down by a malicious DMCA claim from Bloomberg. It is quite interesting to watch (Can be found archive.org).

GN is one of the last pro-consumer outlets, that keep on digging and shaking the tree big companys are sitting on.

topspin9mo ago

For the record, I think GN is excellent and highly credible.

1 more reply

themafia9mo ago

> which got taken down

Not everywhere:

https://archive.org/details/the-nvidia-ai-gpu-black-market-i...

sitkack9mo ago

https://www.youtube.com/watch?v=ZyWelvEP_CQ

2 more replies

mrheosuper9mo ago

When something is uploaded to the internet, it won't be easy to take it down.

Ask Beyonce.

2 more replies

adrian_b9mo ago

The small coolers used by them are not recommended by Noctua for 9950X. Noctua recommends only bigger coolers for 9950X, which dissipates 200 W permanently on a workload like theirs (which is much less than the over 250 W dissipated in similar conditions by the competing Intel CPUs).

Despite this, the overtemperature protection of the CPUs should have protected the CPUs and prevent any kind of damage like this.

Besides the system that varies continuously the clock frequency to keep the CPU within the current and power consumption limits, there is a second protection that stops temporarily the clock when a temperature threshold is exceeded. However, the internal temperature sensors of the CPUs are not accurate, so the over-temperature protection may begin to act only at a temperature that is already too high.

So these failures appear to have been caused by a combination of not using the appropriate coolers for a 200 W CPU, combined with the fact that AMD advertises a 200-W CPU as an 170-W CPU, fooling naive customers into believing that smaller coolers are acceptable, and with either some kind of malfunction of the over-temperature protection in these CPUs or with a degradation problem that happens even within the nominal temperature range, but at its upper end.

ollien9mo ago

> The small coolers used by them are not recommended by Noctua for 9950X

Noctua's CPU compatibility page lists the NH-U9s as "medium turbo/overclocking headroom" for the 9950X [0]. I don't think it's fair to suggest their cooler choice is the problem here.

[0] https://ncc.noctua.at/cpus/model/AMD-Ryzen-9-9950X-1831

adrian_b9mo ago

That means pretty much "not recommended".

On the same page linked by you, Noctua explains that the green check mark means that with that cooler the CPU can run all-core intensive tasks, exactly like those used by the gmplib developers, only at the base clock, which is 4.3 GHz for 9950X, with turbo disabled in BIOS.

Only then the CPU might dissipate its nominal TDP of 170 W, instead of the 200 W that it dissipates with turbo enabled.

With "best turbo headroom", you can be certain that the CPU can run all-core intensive tasks with turbo enabled. Even if you do no overclocking, but you run all-core intensive tasks with turbo enabled, this is the kind of cooler that you need.

Noctua does not define what "medium headroom" means, but presumably it means that you can run with turbo enabled all-core tasks that have medium intensity, not maximum intensity.

There is no doubt that it is a mistake to choose such a cooler when you intend to run intensive multi-threaded computations. A better cooler, but not much bigger, like NH-U12A, has an almost double cooling capacity.

That said, there is also no doubt that AMD is guilty of at least having some bugs in their firmware or in failing to provide adequate documentation for the motherboard manufacturers that adapt the AMD firmware for their MBs.

1 more reply

nerdsniper9mo ago

Wendell at Level1Techs often goes more in-depth on the software testing and datacenter use-case analysis through partnerships with friends who run lots of machines in datacenters.

GN is unique in paying for silicon-level analysis of failures.

der8auer also contributes a lot to these stories.

I tend to wait for all 3 of their analyses, because each adds a different "hard-won" perspective.

fxtentacle9mo ago

He's a bit sensationalist, yes, but I am thankful that he saved us from buying affected Intel CPUs.

bayindirh9mo ago

He's a "student" and friend of late Gordon Mah Ung. He's carrying his torch forward.

This was Gordon's style, and Steve is continuing it. He has the courage to hit Bloomberg offices with a cameraman, so I don't think his words ring hollow.

We need that kind of in your face, no punches held back type of reporting when compared to "measured professionals".

mft_9mo ago

Absolutely - this is the sort of direct citizen journalism I expect (sort of hope?) we'll see more and of as traditional investigative journalism dies its slow death.

tpurves9mo ago

Yes. When he's right, he's right. However the main issue I have with GN is how Steve tends to go full Leeroy Jenkins pitchforks and torches for 9 out of every 5 actual scandals in the tech industry.

jchw9mo ago

When it comes to interpersonal drama, the "Shoot first, ask questions later" style of reporting is terrible. However, for consumer advocacy it's basically the opposite, especially because in most cases it's easy for companies to turn the narrative around by simply handling the issue well. It's almost more about how they handle it than the actual issue in many cases.

CaptainBanger9mo ago

I felt the same way, but over time I have come to respect those with the Crusader personality archetype, we need these people to do their thing and they need us to balance them out.

spookie9mo ago

Not sure of sensationalist or just doing great reporting. I take him as one of the last good tech journalists on the platform.

hnuser1234569mo ago

GN wasn't the first to break the story the 13/14th gen was defective. The thousands and thousands of users experiencing the issues collectively noticed pretty quick. If anything, there was a period where he was saying "We've talked to Intel but we won't say anything yet until they do."

1 more reply

RachelF9mo ago

AMD has failed to be reliable with its Zen 4 and Zen 5 consumer CPUs, just at the same time Intel did the same with their 13k and 14k higher end CPUs.

AMD is somewhat worse than Intel as their DDR5 memory bus is very "twitchy" making it hard to get the highest DDR5 timings, especially with multiple DIMMs per channel.

akimbostrawman9mo ago

I don't think it's reasonable to call memory timing tweaking stability issues worse than a cpu dying from heat under normal usage.

Auracle9mo ago

I had to put together an AM5 computer pretty quickly after I accidentally fried some components in my last computer, so I got a Microcenter bundle.

I got 2x32GB sticks of RAM with the plan to throw in another two sticks later. I had no idea that was now a bad plan. I wish manufacturers would have just put 2 DIMM slots on motherboards as a “warning.”

rangestransform9mo ago

I think that's just a result of being at the limit of what a right-angle memory slot can handle, it's about time that desktop move to CAMM or soldered memory

a-french-anon9mo ago

What do you mean? Is your second sentence the only reason for the first?

trebligdivad9mo ago

They don't say what temperature the CPU was reporting which seems like an odd omission. Whatever the specs of your cooler etc check the temperature it's actually running at. Go by what the CPU is saying! I've got the older 3950x, and the first one died after a few months (still in warranty) with a cooler in spec, but it would go into the 90s at full load just doing big builds. I replaced the heatsink with a basic watercooler when the replacement chip arrived and it's running at least 20c cooler at full load.

Symmetry9mo ago

A modern CPU should be able to detect temperature excursions and bring itself to a safe halt even if you power it up without any cooler attached. It's normal and expected that people making mistakes around the cooling systems of their CPUs will accidentally give themselves terrible performance. It is not normal that the CPUs will break.

account429mo ago

Zen 2 is supposed to be able to work up to 95 C so that shouldn't have caused your CPU to fail. And it should clock down before it fails anyway, way below the specified "minimum" frequency if needed - got to experience that with a failing AIO. A better cooler should only be required to make full use of your CPU not to protect it.

trebligdivad9mo ago

I kind of agree with you and Symmetry; but having had a fried CPU I'm more careful. No electronics like running very hot - so even if you're just inside spec on something for the heat it's likely to live a shorter life than if you kept it more comfortable - and it'll let it clock faster if you keep it cool! And really my points are: * the standard spec coolers just don't manage that on these hot CPUs, even if they claim to. * If you're building a machine and you know you're pushing it hard, just check the temperatures to check that cooling you bought is working.

wkat42429mo ago

Maybe they didn't have anything logging the temperature. They didn't expect it to die after all.

spoaceman77779mo ago

All you really need to see is the picture of the CPU with thermal paste only on one half. Thermal throttling is tuned to work when there is 1. a sufficient heatsink (theirs was significantly below requirements) and 2. it is installed correctly so that its triggers for downclocking happen with the correct timing. This is just another instance of ridiculous PEBCAK error

ndiddy9mo ago

This is per design. On AM5 processors, there's a hotspot on the lower half of the processor where the dies that contain the CPU cores are located. Noctua recommends that AM5 users mount their coolers shifted towards the lower side of the processor for optimal cooling performance, see https://noctua.at/en/offset-am5-mounting-technical-backgroun... . You may have missed the paragraph in the article that explicitly points this out:

> We use a Noctua cooling solution for both systems. For the 1st system, we mounted the heat sink centred. For the 2nd system, we followed Noctua's advice of mounting things offset towards what they claim to be the hotter side of the CPU. Below is a picture of the 2nd system without the heat sink which shows that offset. Note the brackets and their pins, those pins are where the heat sink's pressure gets centred. Also note how the thermal paste has been squeezed away from that part, but is quite thick towards the left.

nolok9mo ago

While it is noctua advice, I don't think AMD supports that view, so it would seem correct to at least test the cpu the way the vendor recommends before making conclusions

1 more reply

dukezzz9mo ago

Noctua recommends mounting their cooler so that the center is shifted toward the lower part of the CPU. From your picture with the thermal paste, it’s clear that your cooler is only making contact with about two-thirds of the CPU, meaning you mounted it incorrectly. The cooler’s contact area must always cover the entire CPU; otherwise, you reduce heat transfer capacity. On top of that, you’re already using an undersized cooler for this CPU. I think you don’t understand the basics of thermodynamics.

1 more reply

marshray9mo ago

Clearly paste was squeezed out from the entire perimeter of the CPU. Offset mounting is used intentionally for this CPU.

Probably there's less paste remaining on the south end of the CPU because that's where the mounting force is greatest.

If anything, there's too much paste remaining on the center/north end of the CPU. Paste exists simply to bridge the roughness of the two metal surfaces, too much paste is a bad sign.

My guess is that the MB was oriented vertically and that big heavy heat sink with the large lever arm pulled it away from the center and north side of the CPU.

IMO, the CPU is still responsible for managing its power usage to live a long life. The only effect of an imperfect thermal solution ought to be proportionally reduced performance.

mrheosuper9mo ago

Many reviewers have tested that too much paste is not an issue, except being messy to clean.

1 more reply

userbinator9mo ago

I'm not as sure about AMD CPUs (and they were known for having far worse overheat behaviour back in the early 2000s) but there are plenty of stories of Intel CPUs working for many years, sitting at the thermal limits, with the (stock) heatsink not even in contact, thanks to their cheap push-pin retention mechanism.

munchlax9mo ago

Those dreadful plastic knobs never want to sit right. Simple lever over that shit any time of day, pls.

db48x9mo ago

> The so-called TDP of the Ryzen 9950X is 170W. The used heat sinks are specified to dissipate 165W, so that seems tight.

TDP numbers are completely made up. They don’t correspond to watts of heat, or of anything at all! They’re just a marketing number. You can't use them to choose the right cooling system at all.

https://gamersnexus.net/guides/3525-amd-ryzen-tdp-explained-...

bayindirh9mo ago

When I see the term TDP, I remember what I have read in the "Thermal Design Document" of Intel Core2Quad Q6600 and the family it belongs:

> The thermal solution bundled with the CPUs is not designed to handle the thermal output when all the cores are utilized 100%. For that kind of load, a different thermal solution is strongly recommended (paraphrased).

I never used the stock cooler bundled with the processor, but what kind of dark joke is this?

johncolanduoni9mo ago

Most states of “100% utilization” as you’d see in `top` are not 100% thermal output or even close. Cores waiting for memory accesses count as utilized in the former sense but will not produce as much heat as one that is actually using the ALU etc. That’s why special make-work like Prime95 is used for stress testing overclocking/thermals: it will saturate the cores with enough unblocked arithmetic work to generate more heat than having 1000 browser tabs open does.

account429mo ago

You're not going to get anywhere near full thermal load with just integer arithmetic either - you need to saturate the floating point units for that.

kokada9mo ago

This is more how I think too: using a cooler that supports your CPU TDP is generally fine because most people will not run a CPU 100% for an extended amount of time. But in this case they seem to be running the CPU 100% for an extended amount of time AND are using an under-spec'ed cooler (even if it is just by 5W).

You don't even need to change the actual cooler since for AMD CPUs you can pretty much customize the TDP whatever way you want, and by default they run well above their efficiency curve. For example, my 7600X has a default TDP of 105W but I run it in Eco Mode (65W) with undervolt and I barely lose any performance. Even if I did no undervolt, running the CPU in Eco Mode is generally preferable since the performance loss is still negligible (~5%).

bayindirh9mo ago

For a general purpose system, this line of thinking makes sense. However, the desktop system in question was built to be daily driven and support some high performance code research, so it had to endure some serious loads for a desktop computer.

I went the other way and overspecced the CPU cooler and added some silent but high CFM capable fans on the system. The motherboard I got was able to adjust all fans depending on the system temps, so it scaled from a very silent desktop to a low-key space heater automatically under load.

Instead of undervolting the processor, I was using a tweaked on-demand governor on the system which stuck to lower power levels more than usual, so unless I was doing software development and testing things, it stayed cool and silent.

BTW, by 100%, I'm talking about completely saturating the CPU pipeline. Not pseudo 100% where CPU reports saturation but most of the load is iowait.

cogman109mo ago

Man that was a beast of a CPU back in the day.

The Conroe Intel era was amazing for the time.

keanebean869mo ago

That was such a fun time to be into hardware. For years Intel had the money and relationships to keep the Pentium 4 everywhere even though AMD had the better product. The P4 might edge ahead in video rendering but the Athlon would win overall and use less power.

Then Conroe launched and the balance shifted. Even the cheapest Core2Duo chips were competitive against the best P4s and the high-end C2Ds rivaled or beat AMD. https://web.archive.org/web/20100909205130/http://www.anandt...

AND those chips overclocked to the moon. I got my E6420 to 3.2ghz (from 2.133ghz) just by upping the multiplier. A quick search makes me think my chip wasn't even that great.

1 more reply

bayindirh9mo ago

Buying parts for that particular desktop was quite fun:

    - Me: Can I get a Q6600?
    - Seller: But, that's... Quad core?
    - Me: Yes, I'll have it.
    - Seller: OK. RAM?
    - Me: I'll get OCZ Flex-XLC Hybrids. 1GB.
    - Seller: *Gives one*
    - Me: I'll get four.
    - Seller: ?
    - Me: Yes, four please.

Motherboard was an MSI P35 Platinum. Fun times.

lofaszvanitt9mo ago

I always used the stock cooler, because it's quiet and nothing uses the cpu to its fullest :).

mrb9mo ago

You are correct. In fact these guys measured a maximum socket power consumption of 240 watt using a 9950X at stock settings, running prime95. So far above the "170 watt" TDP:

https://hwbusters.com/cpu/amd-ryzen-9-9950x-cpu-review-perfo...

dcrazy9mo ago

I don’t understand this argument. If the CPU dissipated an equal number of watts of heat energy as it consumed from the wall, there wouldn’t be any energy left to do actual useful work. Isn’t the extra 100W accounted for by things like changing the state of flip-flops? In other words, mustn’t one consider the entropy reduction of the system as an energy sink?

marshray9mo ago

Clocking and changing register states requires charging and discharging the gate capacitance of a bunch of MOSFET transistors. The current that results from moving all that charge around encounters resistance, which converts it to heat. Silicon is only a "semi" conductor after all.

You are correct that there is energy bound in the information stored in the chip. But last I checked, our most efficient chips (e.g., using reversible computing to avoid wasting that energy) are still orders of magnitude less efficient than those theoretical limits.

1 more reply

MadnessASAP9mo ago

I think the numbers are more like <1W used in actual information processing, >239W lost to heat. Information and the transformation of it does have some inherent energy cost. But it is very, very small. And you end up getting that back as heat somewhere else down the line anyways.

db48x9mo ago

Nope. Remember that you cannot destroy energy. The energy you use to flip the flip flop still exists, only now it’s just disordered waste heat instead of electricity.

2 more replies

arcade799mo ago

What happens to the energy that did the useful work?

aidenn09mo ago

I have a 65W TDP CPU, and the difference in power draw (measured at the outlet) from idle to full CPU load is over 100W; it seems to just raise the clock until it hist 95C, so if I limit the CPU fan's top speed, the power draw goes down.

db48x9mo ago

Yep. Modern CPUs continually adjust their clock multiplier based on what their temperature is doing, plus a few timers. If you have a better cooler then you’ll get more performance out of the same CPU, but at the cost of drawing more power and producing more heat.

einpoklum9mo ago

Wow, I can't believe how BS this TDP is! I feel like a total idiot! I've always assumed it's sorta-kinda a tight upper bound on power consumption, perhaps with some allowance for "imperfections" in the dissipation properties of the CPU, and that I shouldn't sweat the details.

Couldn't this count as false/misleading advertizing though?

gruez9mo ago

It's thermal design power, ie. it's the power that it's designed for, not absolute max.

db48x9mo ago

No, they don’t design the chip with these numbers in mind. The marketing department picks the number they want based on how they want customers to think about the chip, and which competitors they want you to compare it against. They just plug in whatever numbers are needed into the formula so that the number comes out how they want it.

2 more replies

taneq9mo ago

Huh, I always thought it was “total dissipated power”. Like you’d use to spec a power supply.

vel0city9mo ago

Its pretty insane to see someone say something like: “TDP is about thermal watts, not electrical watts. These are not the same.” Watts are watts.

But yeah, TDP means nothing. If you stick plenty of cooling and run the right motherboard board revision your "TDP" can be whatever you want it to be until the thing melts.

o11c9mo ago

"TDP is about average watts, not peak watts" would be an honest way of saying it.

1 more reply

kllrnohj9mo ago

> Couldn't this count as false/misleading advertizing though?

For what, exactly? TDP stands for "thermal design power" - nothing in that means peak power or most power. It stopped being meaningful when CPUs learned to vary clock speeds and turbo boost - what is the thermal design target at that point, exactly? Sustained power virus load?

einpoklum9mo ago

> For what, exactly? TDP stands for "thermal design power"

The chip is not designed for this rate of power dissipation; and it is not the rate of power dissipation that you can expect to get from the chip.

1 more reply

OhMeadhbh9mo ago

When I worked at Linden Lab we had a deal going with IBM. Either as part of that deal or in an attempt to impress our larger partner, many of us got Thinkpads. I actually kind of like them, since the cost wasn't coming out of my budget.

Inside Linden, about 90% of meetings were held in-world, so we constantly had the Second Life viewer up. About three months later our Thinkpads started failing. Apparently they thought people who would a) buy a thinkpad and b) use it to play video games wouldn't be playing video games 12 hours per day (though as many have pointed out, does one "play" Second Life? especially if you're using it for work.)

After 3 months of use, the Second Life client had caused sufficient heating cycles so as to delaminate the PCB under the GPU.

I'm sort of proud of this. Our software was dangerous.

Gracana9mo ago

Delaminating the PCB is absolutely wild. I'm really fascinated by the idea of doing meetings in Second Life, though. What did people use for avatars at work?

OhMeadhbh9mo ago

We used whatever avatars we put together.

Here's a link to Philip's avatar, which he intentionally kept basic for quite some time: https://community.secondlife.com/forums/topic/517881-help-me...

And here's a shot of mine: https://www.flickr.com/photos/opensourceobscure/2476204733/i...

And about in-world meetings. One thing Second Life did VERY WELL was it was always clear who was speaking. If someone was talking, there were green arrows sort of exploding out of their avatar's head. You couldn't miss it. Web-Ex at the time was HORRIBLE in this regard. Teams, Google Meet and Zoom are a little better than Web-Ex, but when meeting in Second Life, you could adjust your camera to get a good view of everyone in the meeting which also helped out.

tux39mo ago

The room temperature or precise way the paste was applied should not matter. Modern CPUs have very advanced dynamic voltage and frequency scaling (DVFS), which accounts for several sensors, including temperature.

These big x86 CPUs in stock configuration can throttle down to speeds where they can function with entirely passive cooling, so even if the cooler was improperly mounted, they'd only throttle.

All that to say, if GMP is causing the CPU to fry itself, something went very wrong, and it is not user error or the room being too hot.

mk_stjames9mo ago

This was my first question as well- I thought it had been a long, long time since you could fry a CPU by taking away the heatsink.

As in... what, AMD K6 / early Pentium 4 days was the last time I remember hearing about cpu cooler failing and frying a cpu?

dwood_dev9mo ago

Athlon era when AMD had no IHS but Intel had one. Intel also had thermal controls that AMD lacked.

I once worked on a piece of equipment that was running awful slow. The CPU was just not budging from its base clock of 700Mhz. As I was removing the stock Intel cooler, I noticed it wasn't seated fully. Once I removed it and looked I saw a perfectly clean CPU with no residue. I looked at the HSF, the original thermal paste was in pristine condition.

I remounted the HSF and it worked great. It ran 100% throttled for seven years before I touched it.

userbinator9mo ago

This infamous video: https://www.youtube.com/watch?v=06MYYB9bl70

p_l9mo ago

K6 depended on motherboard having thermal sensors - and which had to properly attach to the CPU in the first place.

Built-in thermal sensing came later.

Twirrim9mo ago

It was some time around then. I remember AMD being late to it vs Intel.

mook9mo ago

That was SpeedStep? By the time AMD got to it it was just sort of expected and didn't have a fancy name, as far as I know.

Or maybe I'm thinking of something else entirely…

1 more reply

RachelF9mo ago

Yes, this is the point - software should never be able to physically damage the hardware it is on.

If it can, then the hardware is to blame.

mrheosuper9mo ago

As a FW engineer, my software has released the magic smoke a lot.

account429mo ago

That's why firmware is often considered a separate category from software even though technically it's the same thing. Software is code that expects the hardware to works as specified, firmware is what achieves that.

themafia9mo ago

If the throttling is not stable it could increase stress on the part by creating a bunch of transient but large thermal cycles through the chip. It would need to have some kind of exponential backoff on throttle so it doesn't immediately try to raise the frequencies when the temperature slightly dips.

secabeen9mo ago

I would be interested to see if they had the same result with PTM7950 thermal material instead of paste. I've seen significantly better temps with these modern phase-change compounds, and they essentially eliminate application errors.

FuriouslyAdrift9mo ago

Most likely it's the motherboard. ASRock is getting nailed right now for unstable XMP and CPU voltages (it's recommended to undervolt a little just in case).

The Asus Prime B650M motherboards they are using aren't exactly high end.

wmf9mo ago

Yikes, this is the cheapest motherboard and failed Hardware Unboxed VRM tests. https://youtu.be/DTFUa60ozKY?t=744

caycep9mo ago

conversely the asrocks actually did pretty good in that test...

J_Shelby_J9mo ago

My friend just had an ASRock board cook his AMD CPU. Apparently a very common problem.

aidenn09mo ago

Can you link to a reputable source for what settings I should use on my asrock motherboard? I'd like to avoid this.

FuriouslyAdrift9mo ago

No more than 1.2 volts on vsoc... but YMMV.

"According to new details from Tech Yes City, the problem stems from the amperage (current) supplied to the processor under AMD's PBO technology. Precision Boost Overdrive employs an algorithm that dynamically adjusts clock speeds for peak performance, based on factors like temperature, power, current, and workload. The issue is reportedly confined to ASRock's high-end and mid-range boards, as they were tuned far too aggressively for Ryzen 9000 CPUs."

https://www.tomshardware.com/pc-components/cpus/asrock-attri...

FuriouslyAdrift9mo ago

Also... reddit

https://old.reddit.com/r/buildapc/comments/1mvuzjw/alarming_...

kvemkon9mo ago

And the close-up photos of the socket with pins are missing.

T-A9mo ago

A quick search on the NH-U9S shows it's a compact cooler for small systems, rated for up to 140 W (see e.g. [1]).

The 9950X's TDP (Thermal Design Power) is 170 W, its default socket power is 200 W [2], and with PBO (Precision Boost Overdrive) enabled it's been reported to hit 235 W [3].

[1] https://www.overclockersclub.com/reviews/noctua_nh_u9s_cpu_c...

[2] https://hwbusters.com/cpu/amd-ryzen-9-9950x-cpu-review-perfo...

[3] https://www.tomshardware.com/pc-components/cpus/amd-ryzen-9-...

BugsJustFindMe9mo ago

Noctua does not use TDP for their heatsinks and instead have CPU compatibility charts. They say it's fine, with "medium turbo/overclocking headroom". https://ncc.noctua.at/cpus/model/AMD-Ryzen-9-9950X-1831

T-A9mo ago

> Noctua does not use TDP for their heatsinks and instead have CPU compatibility charts.

Reviewers and sellers do, though. Here are a few more: [1][2][3][4]

The highest rating is from [1], which says

You cannot access the TDP guide from here, but we will tell you that it displays 140W TDP; however, it also says you can overclock that to closer to 160W or 180W TDP overall.

AVADirect advertises it as good for 115W [4].

It also beggars belief that a single 92mm fan would suffice to cool a 9950X, when the best 120 and 140 mm air coolers just barely reach 240W [5]. The only Noctua in that review, the 140mm NH-D15S, gets to 233W.

> They say it's fine, with "medium turbo/overclocking headroom"

Hopefully just an innocent mistake...

[1] https://www.tweaktown.com/reviews/7038/noctua-nh-u9s-cpu-coo...

[2] https://www.hardwareslave.com/reviews/cooling/noctua-nh-u9s-...

[3] https://www.frostytech.com/articles/2781/index.html

[4] https://www.avadirect.com/NH-U9S-chromax-black-125mm-Height-...

[5] https://www.tomshardware.com/pc-components/air-cooling/therm...

stouset9mo ago

That’s a good catch, but don’t modern CPUs thermally throttle, rather than risk damage? Not that you should rely on this with an underpowered cooling solution but I would expect worse performance, not a fried chip.

spoaceman77779mo ago

Not really a lot it can do rapidly enough if there's only thermal paste on half the CPU.

It sounds like the user likely did the opposite of the "offset seating" of the heatsink that Noctua recommended.

account429mo ago

There thermal paste on the whole CPU in TFA, it's just thinner on one side because there was more pressure there. Or are you looking at the pic of the heat sink, which is larger than the CPU heat spreader and thus only partially covered by paste?

craftkiller9mo ago

Looking at the AM5 pinout[0], it looks like those pins are VDDCR and VSS. There might be a little bit of PCIe sprinkled in towards the outer edges, but I'm not 100% on the orientation of this pinout vs the orientation of the CPU. I don't know anything about electricity so I've got nothing else to add.

[0] https://upload.wikimedia.org/wikipedia/commons/2/2d/Socket_A...

raverbashing9mo ago

This is a nice guess but the likelihood that actual silicon area is closely connected to the pins in that area is not so obvious

nsteel9mo ago

Isn't almost every other pin going to be power/ground on a high-power chip like this? On both the package and the die.

fxtentacle9mo ago

"We suspect that GMP's extremely tight loops around MULX make the Zen 5 cores use much more power than specified, making cooling solutions inadequate."

I feel like if this was heat related, the overall CPU temperature should still somewhat slowly creep up, thereby giving everything enough time for thermal throttling. But their discoloration sure looks like a thermal issue, so I wonder why the safety features of the CPU didn't catch this...

touisteur9mo ago

I'm guessing the temperature could increase quite fast (milliseconds or less) in heavy duty areas, especially when going scalar-to-dense-vector operations.

My best understanding of the avx-512 'power license' debacle on Intel CPUs was that the processor was actually watching the instruction stream and computing heuristics to lower core frequency before reaching avx512 or dense-avx2 instructions. I guessed they knew or worried that even a short large-vector stint would fry stuff...

Apparently voltage and thermal sensor have vastly improved and looking at the crazy swings on NVIDIA GPU's clocks seem to agree with this :-)

jeffbee9mo ago

Are we talking "slowly" in a relative sense? A silicon die of this size has a thermal mass (guessing) around 10⁻³ J/K but a power dissipation rate over 200W, so it can rise from room temperature to junction temperature limits almost instantly.

topspin9mo ago

People without a background in electronics don't appreciate what modern CPUs and GPUs are doing: the amount of current flowing through these devices is just mind blowing. With adequate cooling, a Ryzen 9 9950X is handling somewhere in the neighborhood of 150-200 amps under high load.

nisegami9mo ago

I initially scoffed at the 150-200 amps. But I know core voltage is usually in the neighbourhood of 1V so to draw 200W, you really would have to basically be moving 200A of current. That's wild.

4 more replies

BearOso9mo ago

They said it took months for each CPU to fail. Both systems used the same inadequate heatsink/fan. Then there's also the lower-end motherboards (they are not "top-quality", the brand means nothing) and the miniscule 450W power supply used in the initial configuration, which are confusingly paired with a 16-core CPU and 64/96GB of RAM.

It doesn't strike me as odd that running an extremely power-heavy load for months continuously on such configurations eventually failed.

edgineer9mo ago

"We don't overclock or overvolt or play other teen games with our hardware."

You overclock as a teen so that as an adult you know to verify your CPU's voltage, clock speeds, and temperature at a minimum when you build your own system.

They made no mention of monitoring of CPU temperature, ECC corrected/detected errors, or throttling. They then ran CPU benchmark loads for several months on the system.

"The so-called TDP of the Ryzen 9950X is 170W. The used heat sinks are specified to dissipate 165W, so that seems tight."

Yikes. You need a heatsink rated much higher. These CPUs were overheated for months.

Gracana9mo ago

CPUs with stock cooling solutions will turbo boost up to max temp and stay there, that's completely normal and shouldn't cause a CPU to physically burn up, even if you do it for months.

edgineer9mo ago

"boost up to max temp and stay there"

At stock settings CPUs will boost depending on many factors. Once one of several different limits is hit the CPU will not boost as high, trying to find a steady state where it stays below Tjmax.

Note the following: the 9950x does not come with a stock cooler. AMD recommends water cooling for the 9950x. Transistor lifetime decreases exponentially with temperature.

I'd expect that a 9950x under sustained load paired with a "165W" cooler would not only not boost, but would throttle to below base clocks.

In the case of CPU cooling, I don't agree that relying on the CPU's thermal safety nets to continuously regulate the system to avoid damage is good practice. With additional cooling to ensure it never reaches Tjmax, this also will result in better CPU performance, a tangible benefit.

Had the author monitored his systems, he would have observed high temperature and throttling. Yes, in 2025 it's arguable that a CPU's safety net should be reliable wrt temperature, even if you run with no heatsink at all. I also agree that TDP specifications are unclear.

But the bottom line is that you should pay the extra $100 or so to cool your CPU properly. It will be faster and more reliable.

Please take care of your equipment; do not take it for granted.

thway152690379mo ago

I don't know about GMP, but I recently built a PC with 9950X3D. As part of initial testing, I ran Prime95 for 48 hours. Everything ran stable, but I noticed that part of the tests, I think it was FFT or something like that, caused incredibly sharp increase in temp. We are talking 60C average in the rest of the test vs immediate (less than a 5 seconds) 95+ degrees when that FFT thingie started. It was very weird.

That's when I discovered actually ancient term "power virus". Anyway, after talking to different people I dismissed this weird behavior and moved on.

Reading this makes me worry I actually burned mobo in that testing.

jmb999mo ago

Different use patterns will result in different temperatures. Very tight math loops (no memory/IO wait) will lead to higher temperatures than something that that relies on L2/3 cache or main memory, even though they’ll both report “100% CPU use” and probably use similar amounts of power. And, different operations will produce heat in different areas of the die; depending on physical layout, some operations might generate heat in a tiny cluster, whereas some others might generate heat in larger spread out areas. Even though both of those cases might use the same amount of power and generate the same amount of heat, the temperatures will be drastically different due to the heat concentration.

Iirc the FFT step uses AVX, and on Zen 5 that’ll be AVX-512. It should keep 100% of the required data in L1 caches, so you’re keeping the AVX units busy literally 100% of the time if things are working right. The rest of the core will be cold/inactive, so if you’re dumping an entire core’s worth of power into a teeny tiny ALU, which is gonna result in high temps. Most (all?) processors downclock under heavy AVX load, sometimes by as much a 1GHz (compared to max boost), because a) the crazy high temperatures results in more instability at higher frequencies, and b) if the clocks were kept high, temperatures would get even higher.

userbinator9mo ago

Try LINPACK, it's even more stressful than Prime95.

bob10299mo ago

Could be the power supply and load profile?

I've heard some really wild noises coming out of my zen4 machine when I've had all cores loaded up with what is best described as "choppy" workloads where we are repeatedly doing something like a parallel.foreach into a single threaded hot path of equal or less duration as fast as possible. I've never had the machine survive this kind of workload for more than 48 hours without some kind of BSOD. I've not actually killed a cpu yet though.

userbinator9mo ago

I've never had the machine survive this kind of workload for more than 48 hours without some kind of BSOD.

Then you shouldn't trust the results of your work either, as that's indicative of a CPU that's producing incorrect results. I suggest lowering the frequency or even undervolting if necessary until you get a stable system.

...and yes, wildly fluctuating power consumption is even more challenging than steady-state high power, since the VRMs have to react precisely and not overshoot or undershoot, or even worse, hit a resonance point. LINPACK, one of the most demanding stress tests and benchmarks, is known for causing crashes on unstable systems not when it starts each round, but when it stops.

bob10299mo ago

The results might be invalid for one generation but the model is resilient to these kinds of events overall. Far more resilient than my operating system is.

Randomly flipped genome bits could even be beneficial for escaping local minima and broken RNG in evolutionary algorithms. One bad evaluation won't throw the whole thing off. It's gotta be bad constantly.

fc417fc8029mo ago

I experienced that with a GPU years ago. A workload I wrote caused a pronounced high frequency noise from the card that I've never encountered the like of before or since. I'd describe it as a very high frequency chirping. I refactored the program rather than seeing what would come of it.

bee_rider9mo ago

Is that, like, an intentional stress-test for the hardware that you’ve come up with?

bob10299mo ago

No. It is just how the algorithms play out:

1. Evaluate population of candidates in parallel

2. Perform ranking, mutation, crossover, and objective selection in serial

3. Go to 1.

I can very accurately control the frequency of the audible PWM noise by adjusting the population size.

nromiun9mo ago

How is that possible? Even if the chip did not get enough cooling it should have been just throttled heavily.

jsheard9mo ago

Modern silicon is so dense and heats up so fast that throttling is easier said than done. I think they have to model and predict the thermals ahead of time nowadays, because by the time they could react to a temp sensor alone, the chip might already be toast.

tliltocatl9mo ago

Maybe the throttling circuitry/firmware simply doesn't have enough time to react.

mastax9mo ago

Enthusiast-oriented motherboards often default enable Precision Boost Overdrive, causing higher power and temperature limits for longer periods. To run the CPU at “stock” you need to go in and disable that. Their default Load Line Calibration might be aggressive as well.

gdwatson9mo ago

Which motherboards enable PBO out of the box? That’s crazy! I know that motherboard manufacturers set some sketchy default turbo durations for Intel CPUs back when Intel was cagey about the spec and let them get away with it, but I thought that AMD was stricter about such things.

userbinator9mo ago

We don't overclock or overvolt or play other teen games with our hardware.

But doesn't the hardware "overclock" and "overvolt" automatically these days?

This reminds me of the Intel CPUs with similar problems a year ago, and AFAIK it was caused by excessive voltage: https://news.ycombinator.com/item?id=41039708

wkat42429mo ago

> But doesn't the hardware "overclock" and "overvolt" automatically these days?

If it's done by the manufacturer it's within spec of course. As designed.

The overclock game was all about running stuff out of spec and getting more performance out of it than it was supposed to create.

account429mo ago

If anything, what replaced overclocking is not PBO and similar features to dynamically clock the CPU but rather binning that lets better performing samples be sold with higher base frequencies than other samples of the same design.

mrheosuper9mo ago

Also "play other teen games" should not damage your cpu.

wkat42429mo ago

Ehh I've seen some of these teen games involve complete immersion in oil or even water (can be done as distilled water doesn't conduct but if only a pinch of salt gets into it...). Or even more extreme things like liquid nitrogen. This can have all sorts of weird effects on CPUs not designed for that kinda stuff (e.g. thermal contraction to temperatures under low load way below spec, or cracking due to extreme thermal gradients).

wkat42429mo ago

> Did the CPUs die of heat stroke? Modern CPUs measure their temperature and clock down if they get too hot, don't they?

They do, but the thermal sensors are spread out a bit. It could be that there's a sudden spot heating happening that's not noticed by one of the sensors in time.

gpapilion9mo ago

Gradual damage is consistent with over heating. I've seen racks of servers do the same thing.

Overall, there is a continued challenge with CPU temperatures that requires much tighter tolerances both in the thermal solution. The torque specs need to be followed and verified that they were met correctly in manufacturing.

rurban9mo ago

The /r/asrock reddit is full of such stories. In my case I've burned two server motherboards with those watercooled 9950X chips. The CPU is still fine though. It's happening with all H100's on 100%. Too much power draw we assume.

tester7569mo ago

My Ryzen CPU recently died too! wtf

FuriouslyAdrift9mo ago

ASRock motherboard?

tester7569mo ago

Gigabyte

LASR9mo ago

Zen5?

tester7569mo ago

Ryzen 7

protocolture9mo ago

I have a Ryzen and it ran fine until one day, after some load, it wont run at all with the virtualisation options turned on anymore.

Having read all I can on the issue its largely been ignored by AMD.

If its some kind of thermal runaway issue that would not surprise me.

esseph9mo ago

I had a bios reset itself to defaults before, and some AMD boards don't have all the required virtualization options on by default.

I only realized this happened because every time I had ever upgraded firmware, I always had to go back and set XMP settings, one or two other things, and than cpu virt option.

Seattle35039mo ago

Same with secure boot for me. Kinda makes sense that a BIOS upgrade would wipe the config. Its that or manage schema migrations.

codezero9mo ago

I recently built a 9950x3d system and it definitely runs hotter/louder than my other builds. To be fair, it also has a 5090 in it, but I liquid cool the 9950 and over all the system is just hot. If I saw 165W rated on a 170W system, I think that kind of just answers it, I see no reason not to overcool a system with high end electronics on it, and there's no reason to toe the line so closely.

wrs9mo ago

No actual die temperature measurements? That would seem a lot more relevant than the ambient temperature.

wtallis9mo ago

Die temperature readings aren't particularly helpful these days with desktop parts that will (depending on the power management settings) more or less keep increasing the clock speed until they reach ~90°C and just stay there. Upgrading from a bad/undersized heatsink can easily have only a tiny effect on temperature but have the effect of significantly increasing clock speed and power.

mqus9mo ago

Aren't they at least useful for ruling out any anomalies there? Like the die temp being 110°C constantly? Imho the die temperature is very important here, even if not interesting.

userbinator9mo ago

but have the effect of significantly increasing clock speed and power.

Ironically, if these failures are due to excessive automatic overvolting like what happened with Intel's a year ago), worse cooling would cause the CPU to hit thermal limits and slow down before harmful voltages are reached. Conversely, giving the CPU great cooling will make it think it can go faster (and with more voltage) since it's still not at the limit, and it ends up going too far and killing itself.

giantg29mo ago

Not that it makes a huge difference since they are supposed to downclock when hot, but what was the actual cooler being used? It doesn't say in the article. My guess is that it's aircooled being only 165W max, but aircooled is not recommended for most newer high end CPUs.

itvision9mo ago

AMD's desktop CPU TDP numbers have been misleading for over a decade.

I have no idea why they do it this way. Everything they list must be multiplied by about 1.35.

For example, a 170W TDP CPU requires 230W of dissipation. Not 170W.

AMD fans don't give a damn about that though. "It's just fine" (tm).

ac130kz8mo ago

ASRock has Zen 5 CPUs dying with stock settings from brief core voltage spikes on idle (max core frequencies). I believe either PBO has to be enabled to allow undervolting headroom or VSOC has to be permanently fixed to a value lower than 1.2V.

caycep9mo ago

I wonder if the risk is mitigated if you turn off PBO and turn on Eco Mode?

hpcjoe9mo ago

I noticed the comments pointing out that TDP is a marketing number, and max power draw for this part can be higher. The cooling seems to have been inadequate.

A rule of thumb I use for cooling is, you can rarely have too much. You should over-engineer that aspect of your systems. That and the power supply.

I have a 7950x, with a water block capable of sinking up to 300W. Under heavy load, I hear the radiator fans spinning up, and I see the cpu temp hover around 90-93 C. That is ok, though cooler would be better. My next build (this one is 2 years old) will also use a water block, but with a higher flow rate, and a better radiator system. I like silent systems, though I don't like the magic smoke being released from components.

chaoskitty9mo ago

This isn't good. Then again, the amount of power going in to these CPUs is way too high.

Take the AlphaServer DS25. It has wires going from the power supply harness to the motherboard that are thick enough to jump a car. The traces on the motherboard are so thick that pictures of the light reflecting off of them are nothing like a modern motherboard. The two CPUs take 64 watts each.

Now we have AMD CPUs that can take 170 watts? That's high, but if that's what the motherboards are supposed to be able to deliver, then the pins, socket and pads should have no problem with that.

Where's AMD's testing? Have they learned nothing watching Intel (almost literally) melt down?

wkat42429mo ago

> Take the AlphaServer DS25. It has wires going from the power supply harness to the motherboard that are thick enough to jump a car. The traces on the motherboard are so thick that pictures of the light reflecting off of them are nothing like a modern motherboard. The two CPUs take 64 watts each.

I am not involved in power VRM for modern moderboards. But I can imagine they are some some smart stuff like compensating for transport losses by increasing the voltage somewhat at the VRM so the designed voltage still outputs at the CPU. Of course this will cause some heating in the motherboard but it's probably easily controlled.

In the day of the alpha that kind of thing would have been science fiction so they had no alternative but to minimise losses. You can't use a static overvoltage because then when the load drops the voltage coming out will be too high (transport loss depends on current).

Also, in those days copper cost a fraction of what it costs now so with any problem just doing 'moah copper' was an easy solution. Especially on server hardware like the Alpha with big markup.

And server hardware is always overengineered of course. Precisely to prevent long-term load problems like this.

rozab9mo ago

I've not built a new PC in a while, is it normal for the cooler to only cover 2/3 of the IHS?? Seems like that leaves a lot of cooling performance on the table

dukezzz9mo ago

i think that if the cpu tdp is 170W and your cooling solution is rated for 165W and both cpu broke under sustained full load you replied yourself. Plus i don't see how reducing the contact ara with the skewed mounting can benefit in any way the heat transfer

tw049mo ago

That looks like a combination of improperly mounting the heatsink and noctuna being wrong in their recommendation to offset it. I’d imagine for gaming cooling one side more makes sense but my completely uneducated guess is that GMP is working a different part of the CPU than gaming does.

toast09mo ago

They had failures with standard mounting and offset mounting.

Also, take a look at a delidded 9950; the two cpu chiplets are to one side, the i/o chiplet is in the middle, and the other side is a handful of passives. Offsetting the heatsink moves the center of the heatsink 7mm towards the chiplets (the socket is 40mm x 40mm), but there's still plenty of heatsink over the top of the i/o chiplet.

This article has some decent pictures of delidded processors https://www.tomshardware.com/pc-components/overclocking/deli...

jsheard9mo ago

This is what Zen5 looks like under the IHS: https://i.imgur.com/j85YUzX.jpeg

Everything is offset towards one side and the two CPU core clusters are way towards the edge, offset cooling makes sense regardless of usage.

pharrington9mo ago

I'd assume both GMP and any CPU intensive game just prefer the performance cores.

jsheard9mo ago

AMDs desktop chips don't have distinct P and E cores, they're all P cores. AMD do have an E core design but it's currently only used in mobile and server parts.

pharrington9mo ago

Gotcha. Apparently Intel's marketing's gotten to me. I haven't really been keeping up with this stuff, so whenever I read about P & E cores in the past, I think I just assumed that was a thing both Intel & AMD were doing, without considering the source material too closely.

1 more reply

on_the_train9mo ago

What is gmp?

gus_massa9mo ago

From https://gmplib.org/#WHAT

> What is GMP?

> The GNU Multiple Precision Arithmetic Library

> GMP is a free library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating-point numbers. There is no practical limit to the precision except the ones implied by the available memory in the machine GMP runs on. GMP has a rich set of functions, and the functions have a regular interface.

Many languages use it to implement long integers. Under the hood, they just call GMP.

IIUC the problem is related to the test suit, that is probably very handy if you ever want to fry an egg on top of your micro.

kgwgk9mo ago

The domain has the answer: https://gmplib.org/

protomikron9mo ago

Valid question i think in this context. I knew about GNU multiprecision library, but thought that couldnt be it, as it's "just" a highly optimized low level bit fiddling lib (at least thats my expectation without looking into the source), so it's strange why it could be damaging Hardware ...

account429mo ago

TFA is on gmplib.org which kind of answers the question though.

beezle9mo ago

At first I thought it was Green Mountain Power ;)

monster_truck9mo ago

All other potential causes aside, including the likely most-relevant of motherboard companies exceeding recommended defaults for power delivery: Running a cooling solution good for less than the TDP (which is NOT the max power, which tends to be about 30% higher than the TDP on these) is frankly extremely dumb. I've seen x950 processors of every generation pull at least double that on extreme workloads. I think it speaks to them being a bit clueless that they did not manually lower the thermal limits. You can cut the power and thermal limits by wild amounts and barely lose 15% multicore performance.

Ono-Sendai9mo ago

That's wild. My 9950X died for some reason, never overclocked.

lloydatkinson9mo ago

One day I’ll understand why some websites refuse to have a way of navigating to the home page. I had to edit the URL in the address bar.

I just wanted to find out what GMP is.

account429mo ago

The better question is, why isn't your browser providing that feature? Or why is it making it incredibly hard to edit URLs in case you are using a mobile browser?

mjh25399mo ago

arbitrary-precision/bignum library

Michvalwin9mo ago

About two months ago My 9950X3D died. I presume it the memory controller died because my asus board reported a ram problem. It was with a 2x48GB 6000 ram. Which was not supported by the board (at least on the docs). I also limited the max PBO by 400Mhz and limited VDDIO to 1.3V. But it still died while shutting down cachyOS.

j / k navigate · click thread line to collapse

233 comments

topspin9mo ago

> Modern CPUs measure their temperature and clock down if they get too hot, don't they?

Yes. It's rather complex now and it involves the motherboard vendor's firmware. When (not if) they get that wrong CPUs burn up. You're going to need some expertise to analyze this.

MrGilbert9mo ago

> [...] a bunch of dramatic YouTube content [...]

GN is one of the last pro-consumer outlets, that keep on digging and shaking the tree big companys are sitting on.

topspin9mo ago

For the record, I think GN is excellent and highly credible.

1 more reply

themafia9mo ago

> which got taken down

Not everywhere:

https://archive.org/details/the-nvidia-ai-gpu-black-market-i...

sitkack9mo ago

https://www.youtube.com/watch?v=ZyWelvEP_CQ

2 more replies

mrheosuper9mo ago

When something is uploaded to the internet, it won't be easy to take it down.

Ask Beyonce.

2 more replies

adrian_b9mo ago

Despite this, the overtemperature protection of the CPUs should have protected the CPUs and prevent any kind of damage like this.

ollien9mo ago

> The small coolers used by them are not recommended by Noctua for 9950X

Noctua's CPU compatibility page lists the NH-U9s as "medium turbo/overclocking headroom" for the 9950X [0]. I don't think it's fair to suggest their cooler choice is the problem here.

[0] https://ncc.noctua.at/cpus/model/AMD-Ryzen-9-9950X-1831

adrian_b9mo ago

That means pretty much "not recommended".

Only then the CPU might dissipate its nominal TDP of 170 W, instead of the 200 W that it dissipates with turbo enabled.

Noctua does not define what "medium headroom" means, but presumably it means that you can run with turbo enabled all-core tasks that have medium intensity, not maximum intensity.

1 more reply

nerdsniper9mo ago

Wendell at Level1Techs often goes more in-depth on the software testing and datacenter use-case analysis through partnerships with friends who run lots of machines in datacenters.

GN is unique in paying for silicon-level analysis of failures.

der8auer also contributes a lot to these stories.

I tend to wait for all 3 of their analyses, because each adds a different "hard-won" perspective.

fxtentacle9mo ago

He's a bit sensationalist, yes, but I am thankful that he saved us from buying affected Intel CPUs.

bayindirh9mo ago

He's a "student" and friend of late Gordon Mah Ung. He's carrying his torch forward.

This was Gordon's style, and Steve is continuing it. He has the courage to hit Bloomberg offices with a cameraman, so I don't think his words ring hollow.

We need that kind of in your face, no punches held back type of reporting when compared to "measured professionals".

mft_9mo ago

Absolutely - this is the sort of direct citizen journalism I expect (sort of hope?) we'll see more and of as traditional investigative journalism dies its slow death.

tpurves9mo ago

Yes. When he's right, he's right. However the main issue I have with GN is how Steve tends to go full Leeroy Jenkins pitchforks and torches for 9 out of every 5 actual scandals in the tech industry.

jchw9mo ago

CaptainBanger9mo ago

I felt the same way, but over time I have come to respect those with the Crusader personality archetype, we need these people to do their thing and they need us to balance them out.

spookie9mo ago

Not sure of sensationalist or just doing great reporting. I take him as one of the last good tech journalists on the platform.

hnuser1234569mo ago

1 more reply

RachelF9mo ago

AMD has failed to be reliable with its Zen 4 and Zen 5 consumer CPUs, just at the same time Intel did the same with their 13k and 14k higher end CPUs.

AMD is somewhat worse than Intel as their DDR5 memory bus is very "twitchy" making it hard to get the highest DDR5 timings, especially with multiple DIMMs per channel.

akimbostrawman9mo ago

I don't think it's reasonable to call memory timing tweaking stability issues worse than a cpu dying from heat under normal usage.

Auracle9mo ago

I had to put together an AM5 computer pretty quickly after I accidentally fried some components in my last computer, so I got a Microcenter bundle.

rangestransform9mo ago

I think that's just a result of being at the limit of what a right-angle memory slot can handle, it's about time that desktop move to CAMM or soldered memory

a-french-anon9mo ago

What do you mean? Is your second sentence the only reason for the first?

trebligdivad9mo ago

Symmetry9mo ago

account429mo ago

trebligdivad9mo ago

wkat42429mo ago

Maybe they didn't have anything logging the temperature. They didn't expect it to die after all.

spoaceman77779mo ago

ndiddy9mo ago

nolok9mo ago

While it is noctua advice, I don't think AMD supports that view, so it would seem correct to at least test the cpu the way the vendor recommends before making conclusions

1 more reply

dukezzz9mo ago

1 more reply

marshray9mo ago

Clearly paste was squeezed out from the entire perimeter of the CPU. Offset mounting is used intentionally for this CPU.

Probably there's less paste remaining on the south end of the CPU because that's where the mounting force is greatest.

If anything, there's too much paste remaining on the center/north end of the CPU. Paste exists simply to bridge the roughness of the two metal surfaces, too much paste is a bad sign.

My guess is that the MB was oriented vertically and that big heavy heat sink with the large lever arm pulled it away from the center and north side of the CPU.

IMO, the CPU is still responsible for managing its power usage to live a long life. The only effect of an imperfect thermal solution ought to be proportionally reduced performance.

mrheosuper9mo ago

Many reviewers have tested that too much paste is not an issue, except being messy to clean.

1 more reply

userbinator9mo ago

munchlax9mo ago

Those dreadful plastic knobs never want to sit right. Simple lever over that shit any time of day, pls.

db48x9mo ago

> The so-called TDP of the Ryzen 9950X is 170W. The used heat sinks are specified to dissipate 165W, so that seems tight.

TDP numbers are completely made up. They don’t correspond to watts of heat, or of anything at all! They’re just a marketing number. You can't use them to choose the right cooling system at all.

https://gamersnexus.net/guides/3525-amd-ryzen-tdp-explained-...

bayindirh9mo ago

When I see the term TDP, I remember what I have read in the "Thermal Design Document" of Intel Core2Quad Q6600 and the family it belongs:

I never used the stock cooler bundled with the processor, but what kind of dark joke is this?

johncolanduoni9mo ago

account429mo ago

You're not going to get anywhere near full thermal load with just integer arithmetic either - you need to saturate the floating point units for that.

kokada9mo ago

bayindirh9mo ago

BTW, by 100%, I'm talking about completely saturating the CPU pipeline. Not pseudo 100% where CPU reports saturation but most of the load is iowait.

cogman109mo ago

Man that was a beast of a CPU back in the day.

The Conroe Intel era was amazing for the time.

keanebean869mo ago

AND those chips overclocked to the moon. I got my E6420 to 3.2ghz (from 2.133ghz) just by upping the multiplier. A quick search makes me think my chip wasn't even that great.

1 more reply

bayindirh9mo ago

Buying parts for that particular desktop was quite fun:

    - Me: Can I get a Q6600?
    - Seller: But, that's... Quad core?
    - Me: Yes, I'll have it.
    - Seller: OK. RAM?
    - Me: I'll get OCZ Flex-XLC Hybrids. 1GB.
    - Seller: *Gives one*
    - Me: I'll get four.
    - Seller: ?
    - Me: Yes, four please.

Motherboard was an MSI P35 Platinum. Fun times.

lofaszvanitt9mo ago

I always used the stock cooler, because it's quiet and nothing uses the cpu to its fullest :).

mrb9mo ago

You are correct. In fact these guys measured a maximum socket power consumption of 240 watt using a 9950X at stock settings, running prime95. So far above the "170 watt" TDP:

https://hwbusters.com/cpu/amd-ryzen-9-9950x-cpu-review-perfo...

dcrazy9mo ago

marshray9mo ago

1 more reply

MadnessASAP9mo ago

db48x9mo ago

Nope. Remember that you cannot destroy energy. The energy you use to flip the flip flop still exists, only now it’s just disordered waste heat instead of electricity.

2 more replies

arcade799mo ago

What happens to the energy that did the useful work?

aidenn09mo ago

db48x9mo ago

einpoklum9mo ago

Couldn't this count as false/misleading advertizing though?

gruez9mo ago

It's thermal design power, ie. it's the power that it's designed for, not absolute max.

db48x9mo ago

2 more replies

taneq9mo ago

Huh, I always thought it was “total dissipated power”. Like you’d use to spec a power supply.

vel0city9mo ago

Its pretty insane to see someone say something like: “TDP is about thermal watts, not electrical watts. These are not the same.” Watts are watts.

But yeah, TDP means nothing. If you stick plenty of cooling and run the right motherboard board revision your "TDP" can be whatever you want it to be until the thing melts.

o11c9mo ago

"TDP is about average watts, not peak watts" would be an honest way of saying it.

1 more reply

kllrnohj9mo ago

> Couldn't this count as false/misleading advertizing though?

einpoklum9mo ago

> For what, exactly? TDP stands for "thermal design power"

The chip is not designed for this rate of power dissipation; and it is not the rate of power dissipation that you can expect to get from the chip.

1 more reply

OhMeadhbh9mo ago

After 3 months of use, the Second Life client had caused sufficient heating cycles so as to delaminate the PCB under the GPU.

I'm sort of proud of this. Our software was dangerous.

Gracana9mo ago

Delaminating the PCB is absolutely wild. I'm really fascinated by the idea of doing meetings in Second Life, though. What did people use for avatars at work?

OhMeadhbh9mo ago

We used whatever avatars we put together.

Here's a link to Philip's avatar, which he intentionally kept basic for quite some time: https://community.secondlife.com/forums/topic/517881-help-me...

And here's a shot of mine: https://www.flickr.com/photos/opensourceobscure/2476204733/i...

tux39mo ago

These big x86 CPUs in stock configuration can throttle down to speeds where they can function with entirely passive cooling, so even if the cooler was improperly mounted, they'd only throttle.

All that to say, if GMP is causing the CPU to fry itself, something went very wrong, and it is not user error or the room being too hot.

mk_stjames9mo ago

This was my first question as well- I thought it had been a long, long time since you could fry a CPU by taking away the heatsink.

As in... what, AMD K6 / early Pentium 4 days was the last time I remember hearing about cpu cooler failing and frying a cpu?

dwood_dev9mo ago

Athlon era when AMD had no IHS but Intel had one. Intel also had thermal controls that AMD lacked.

I remounted the HSF and it worked great. It ran 100% throttled for seven years before I touched it.

userbinator9mo ago

This infamous video: https://www.youtube.com/watch?v=06MYYB9bl70

p_l9mo ago

K6 depended on motherboard having thermal sensors - and which had to properly attach to the CPU in the first place.

Built-in thermal sensing came later.

Twirrim9mo ago

It was some time around then. I remember AMD being late to it vs Intel.

mook9mo ago

That was SpeedStep? By the time AMD got to it it was just sort of expected and didn't have a fancy name, as far as I know.

Or maybe I'm thinking of something else entirely…

1 more reply

RachelF9mo ago

Yes, this is the point - software should never be able to physically damage the hardware it is on.

If it can, then the hardware is to blame.

mrheosuper9mo ago

As a FW engineer, my software has released the magic smoke a lot.

account429mo ago

themafia9mo ago

secabeen9mo ago

FuriouslyAdrift9mo ago

Most likely it's the motherboard. ASRock is getting nailed right now for unstable XMP and CPU voltages (it's recommended to undervolt a little just in case).

The Asus Prime B650M motherboards they are using aren't exactly high end.

wmf9mo ago

Yikes, this is the cheapest motherboard and failed Hardware Unboxed VRM tests. https://youtu.be/DTFUa60ozKY?t=744

caycep9mo ago

conversely the asrocks actually did pretty good in that test...

J_Shelby_J9mo ago

My friend just had an ASRock board cook his AMD CPU. Apparently a very common problem.

aidenn09mo ago

Can you link to a reputable source for what settings I should use on my asrock motherboard? I'd like to avoid this.

FuriouslyAdrift9mo ago

No more than 1.2 volts on vsoc... but YMMV.

https://www.tomshardware.com/pc-components/cpus/asrock-attri...

FuriouslyAdrift9mo ago

Also... reddit

https://old.reddit.com/r/buildapc/comments/1mvuzjw/alarming_...

kvemkon9mo ago

And the close-up photos of the socket with pins are missing.

T-A9mo ago

A quick search on the NH-U9S shows it's a compact cooler for small systems, rated for up to 140 W (see e.g. [1]).

The 9950X's TDP (Thermal Design Power) is 170 W, its default socket power is 200 W [2], and with PBO (Precision Boost Overdrive) enabled it's been reported to hit 235 W [3].

[1] https://www.overclockersclub.com/reviews/noctua_nh_u9s_cpu_c...

[2] https://hwbusters.com/cpu/amd-ryzen-9-9950x-cpu-review-perfo...

[3] https://www.tomshardware.com/pc-components/cpus/amd-ryzen-9-...

BugsJustFindMe9mo ago

T-A9mo ago

> Noctua does not use TDP for their heatsinks and instead have CPU compatibility charts.

Reviewers and sellers do, though. Here are a few more: [1][2][3][4]

The highest rating is from [1], which says

You cannot access the TDP guide from here, but we will tell you that it displays 140W TDP; however, it also says you can overclock that to closer to 160W or 180W TDP overall.

AVADirect advertises it as good for 115W [4].

> They say it's fine, with "medium turbo/overclocking headroom"

Hopefully just an innocent mistake...

[1] https://www.tweaktown.com/reviews/7038/noctua-nh-u9s-cpu-coo...

[2] https://www.hardwareslave.com/reviews/cooling/noctua-nh-u9s-...

[3] https://www.frostytech.com/articles/2781/index.html

[4] https://www.avadirect.com/NH-U9S-chromax-black-125mm-Height-...

[5] https://www.tomshardware.com/pc-components/air-cooling/therm...

stouset9mo ago

spoaceman77779mo ago

Not really a lot it can do rapidly enough if there's only thermal paste on half the CPU.

It sounds like the user likely did the opposite of the "offset seating" of the heatsink that Noctua recommended.

account429mo ago

craftkiller9mo ago

[0] https://upload.wikimedia.org/wikipedia/commons/2/2d/Socket_A...

raverbashing9mo ago

This is a nice guess but the likelihood that actual silicon area is closely connected to the pins in that area is not so obvious

nsteel9mo ago

Isn't almost every other pin going to be power/ground on a high-power chip like this? On both the package and the die.

fxtentacle9mo ago

"We suspect that GMP's extremely tight loops around MULX make the Zen 5 cores use much more power than specified, making cooling solutions inadequate."

touisteur9mo ago

I'm guessing the temperature could increase quite fast (milliseconds or less) in heavy duty areas, especially when going scalar-to-dense-vector operations.

Apparently voltage and thermal sensor have vastly improved and looking at the crazy swings on NVIDIA GPU's clocks seem to agree with this :-)

jeffbee9mo ago

topspin9mo ago

nisegami9mo ago

I initially scoffed at the 150-200 amps. But I know core voltage is usually in the neighbourhood of 1V so to draw 200W, you really would have to basically be moving 200A of current. That's wild.

4 more replies

BearOso9mo ago

It doesn't strike me as odd that running an extremely power-heavy load for months continuously on such configurations eventually failed.

edgineer9mo ago

"We don't overclock or overvolt or play other teen games with our hardware."

You overclock as a teen so that as an adult you know to verify your CPU's voltage, clock speeds, and temperature at a minimum when you build your own system.

They made no mention of monitoring of CPU temperature, ECC corrected/detected errors, or throttling. They then ran CPU benchmark loads for several months on the system.

"The so-called TDP of the Ryzen 9950X is 170W. The used heat sinks are specified to dissipate 165W, so that seems tight."

Yikes. You need a heatsink rated much higher. These CPUs were overheated for months.

Gracana9mo ago

CPUs with stock cooling solutions will turbo boost up to max temp and stay there, that's completely normal and shouldn't cause a CPU to physically burn up, even if you do it for months.

edgineer9mo ago

"boost up to max temp and stay there"

At stock settings CPUs will boost depending on many factors. Once one of several different limits is hit the CPU will not boost as high, trying to find a steady state where it stays below Tjmax.

Note the following: the 9950x does not come with a stock cooler. AMD recommends water cooling for the 9950x. Transistor lifetime decreases exponentially with temperature.

I'd expect that a 9950x under sustained load paired with a "165W" cooler would not only not boost, but would throttle to below base clocks.

But the bottom line is that you should pay the extra $100 or so to cool your CPU properly. It will be faster and more reliable.

Please take care of your equipment; do not take it for granted.

thway152690379mo ago

That's when I discovered actually ancient term "power virus". Anyway, after talking to different people I dismissed this weird behavior and moved on.

Reading this makes me worry I actually burned mobo in that testing.

jmb999mo ago

userbinator9mo ago

Try LINPACK, it's even more stressful than Prime95.

bob10299mo ago

Could be the power supply and load profile?

userbinator9mo ago

I've never had the machine survive this kind of workload for more than 48 hours without some kind of BSOD.

bob10299mo ago

The results might be invalid for one generation but the model is resilient to these kinds of events overall. Far more resilient than my operating system is.

fc417fc8029mo ago

bee_rider9mo ago

Is that, like, an intentional stress-test for the hardware that you’ve come up with?

bob10299mo ago

No. It is just how the algorithms play out:

1. Evaluate population of candidates in parallel

2. Perform ranking, mutation, crossover, and objective selection in serial

3. Go to 1.

I can very accurately control the frequency of the audible PWM noise by adjusting the population size.

nromiun9mo ago

How is that possible? Even if the chip did not get enough cooling it should have been just throttled heavily.

jsheard9mo ago

tliltocatl9mo ago

Maybe the throttling circuitry/firmware simply doesn't have enough time to react.

mastax9mo ago

gdwatson9mo ago

userbinator9mo ago

We don't overclock or overvolt or play other teen games with our hardware.

But doesn't the hardware "overclock" and "overvolt" automatically these days?

This reminds me of the Intel CPUs with similar problems a year ago, and AFAIK it was caused by excessive voltage: https://news.ycombinator.com/item?id=41039708

wkat42429mo ago

> But doesn't the hardware "overclock" and "overvolt" automatically these days?

If it's done by the manufacturer it's within spec of course. As designed.

The overclock game was all about running stuff out of spec and getting more performance out of it than it was supposed to create.

account429mo ago

mrheosuper9mo ago

Also "play other teen games" should not damage your cpu.

wkat42429mo ago

> Did the CPUs die of heat stroke? Modern CPUs measure their temperature and clock down if they get too hot, don't they?

They do, but the thermal sensors are spread out a bit. It could be that there's a sudden spot heating happening that's not noticed by one of the sensors in time.

gpapilion9mo ago

Gradual damage is consistent with over heating. I've seen racks of servers do the same thing.

rurban9mo ago

tester7569mo ago

My Ryzen CPU recently died too! wtf

FuriouslyAdrift9mo ago

ASRock motherboard?

tester7569mo ago

Gigabyte

LASR9mo ago

Zen5?

tester7569mo ago

Ryzen 7

protocolture9mo ago

I have a Ryzen and it ran fine until one day, after some load, it wont run at all with the virtualisation options turned on anymore.

Having read all I can on the issue its largely been ignored by AMD.

If its some kind of thermal runaway issue that would not surprise me.

esseph9mo ago

I had a bios reset itself to defaults before, and some AMD boards don't have all the required virtualization options on by default.

I only realized this happened because every time I had ever upgraded firmware, I always had to go back and set XMP settings, one or two other things, and than cpu virt option.

Seattle35039mo ago

Same with secure boot for me. Kinda makes sense that a BIOS upgrade would wipe the config. Its that or manage schema migrations.

codezero9mo ago

wrs9mo ago

No actual die temperature measurements? That would seem a lot more relevant than the ambient temperature.

wtallis9mo ago

mqus9mo ago

Aren't they at least useful for ruling out any anomalies there? Like the die temp being 110°C constantly? Imho the die temperature is very important here, even if not interesting.

userbinator9mo ago

but have the effect of significantly increasing clock speed and power.

giantg29mo ago

itvision9mo ago

AMD's desktop CPU TDP numbers have been misleading for over a decade.

I have no idea why they do it this way. Everything they list must be multiplied by about 1.35.

For example, a 170W TDP CPU requires 230W of dissipation. Not 170W.

AMD fans don't give a damn about that though. "It's just fine" (tm).

ac130kz8mo ago

caycep9mo ago

I wonder if the risk is mitigated if you turn off PBO and turn on Eco Mode?

hpcjoe9mo ago

I noticed the comments pointing out that TDP is a marketing number, and max power draw for this part can be higher. The cooling seems to have been inadequate.

A rule of thumb I use for cooling is, you can rarely have too much. You should over-engineer that aspect of your systems. That and the power supply.

chaoskitty9mo ago

This isn't good. Then again, the amount of power going in to these CPUs is way too high.

Now we have AMD CPUs that can take 170 watts? That's high, but if that's what the motherboards are supposed to be able to deliver, then the pins, socket and pads should have no problem with that.

Where's AMD's testing? Have they learned nothing watching Intel (almost literally) melt down?

wkat42429mo ago

Also, in those days copper cost a fraction of what it costs now so with any problem just doing 'moah copper' was an easy solution. Especially on server hardware like the Alpha with big markup.

And server hardware is always overengineered of course. Precisely to prevent long-term load problems like this.

rozab9mo ago

I've not built a new PC in a while, is it normal for the cooler to only cover 2/3 of the IHS?? Seems like that leaves a lot of cooling performance on the table

dukezzz9mo ago

tw049mo ago

toast09mo ago

They had failures with standard mounting and offset mounting.

This article has some decent pictures of delidded processors https://www.tomshardware.com/pc-components/overclocking/deli...

jsheard9mo ago

This is what Zen5 looks like under the IHS: https://i.imgur.com/j85YUzX.jpeg

Everything is offset towards one side and the two CPU core clusters are way towards the edge, offset cooling makes sense regardless of usage.

pharrington9mo ago

I'd assume both GMP and any CPU intensive game just prefer the performance cores.

jsheard9mo ago

AMDs desktop chips don't have distinct P and E cores, they're all P cores. AMD do have an E core design but it's currently only used in mobile and server parts.

pharrington9mo ago

1 more reply

on_the_train9mo ago

What is gmp?

gus_massa9mo ago

From https://gmplib.org/#WHAT

> What is GMP?

> The GNU Multiple Precision Arithmetic Library

Many languages use it to implement long integers. Under the hood, they just call GMP.

IIUC the problem is related to the test suit, that is probably very handy if you ever want to fry an egg on top of your micro.

kgwgk9mo ago

The domain has the answer: https://gmplib.org/

protomikron9mo ago

account429mo ago

TFA is on gmplib.org which kind of answers the question though.

beezle9mo ago

At first I thought it was Green Mountain Power ;)

monster_truck9mo ago

Ono-Sendai9mo ago

That's wild. My 9950X died for some reason, never overclocked.

lloydatkinson9mo ago

One day I’ll understand why some websites refuse to have a way of navigating to the home page. I had to edit the URL in the address bar.

I just wanted to find out what GMP is.

account429mo ago

The better question is, why isn't your browser providing that feature? Or why is it making it incredibly hard to edit URLs in case you are using a mobile browser?

mjh25399mo ago

arbitrary-precision/bignum library

Michvalwin9mo ago

j / k navigate · click thread line to collapse