I've been chasing flimsy but very annoying stability problems (some, of course, due to overclocking during my younger years, when it still had a tangible payoff) enough times on systems I had built that taking this one BIG potential cause out of the equation is worth the few dozens of extra bucks I have to spend on ECC-capable gear many times over.
Trying to validate an ECC-less platform's stability is surprisingly hard, because memtest and friends just aren't very reliably detecting more subtle problems. PRIME95, y-cruncher and linpack (in increasing order of effectiveness) are better than specialzied memory testing software in my experience, but they are not perfect, either.
Most AMD CPUs (but not their APUs with potent iGPUs - there, you will have to buy the "PRO" variants) these days have full support for ECC UDIMMs. If your mainboard vendor also plays ball - annoyingly, only a minority of them enables ECC support in their firmware, so always check for that before buying! - there's not much that can prevent you from having that stability enhancement and reassuring peace of mind.
Quoth DJB (around the very start of this millenium): https://cr.yp.to/hardware/ecc.html :)
This is the annoying part.
That AMD permits ECC is a truly fantastic situation, but if it's supported by the motherboard is often unlikely and worse: it's not advertised even when it's available.
I have an ASUS PRIME TRX40 PRO and the tech specs say that it can run ECC and non-ECC but not if ECC will be available to the operating system, merely that the DIMMS will work.
It's much more hit and miss in reality than it should be, though this motherboard was a pricey one: one can't use price as a proxy for features.
EDAC MC0: Giving out device to module amd64_edac
is a pretty reliable indication that ECC is working.See my blog post about it (it was top of HN): https://sunshowers.io/posts/am5-ryzen-7000-ecc-ram/
I would assume your particular motherboard to operate with proper SECDED+-level ECC if you have capable, compatible DIMM, enable ECC mode in the firmware, and boot an OS kernel that can make sense of it all.
I am writing this message on such an ASUS MB with a Ryzen CPU and working ECC memory. You must check that you actually have a recent enough OS to know your Threadripper CPU and that you have installed any software package required for this (e.g. on Linux "edac-utils" or something with a similar name).
Some businesses (and governments) try and unify their purchasing, but this seems to make things worse, with the purchasing department both not understanding technology and being outwitted by vendors.
I've been building my own gaming and productivity rigs for 20 years and I don't think memory has ever been a problem. Maybe survivorship bias, but surely even budget parts aren't THIS bad.
Also: DDR5 has some false ecc marketing due to the memory standard having an error correction scheme build in. Don't fall for it.
A computer with 64 GB of memory is 4 times more likely to encounter memory errors than one with 16 GB of memory.
When DIMMs are new, at the usual amounts of memory for desktops, you will see at most a few errors per year, sometimes only an error after a few years. With old DIMMs, some of them will start to have frequent errors (such modules presumably had a borderline bad fabrication quality and now have become worn out, e.g. due to increased leakage leading to storing a lower amount of charge on the memory cell capacitors).
For such bad DIMMs, the frequency of errors will increase, and it may become of several errors per day, or even per hour.
For me, a very important advantage of ECC has been the ability to detect such bad memory modules (in computers that have been used for 5 years or more) and replace them before corrupting any precious data.
I also had a case with a HP laptop with ECC, where memory errors had become frequent after being stored for a long time (more than a year) in a rather humid place, which might have caused some oxidation of the SODIMM socket contacts, because removing the SODIMMs, scrubbing the sockets and reinserting the SODIMMs made disappear the errors.
94 2025-08-26 01:49:40 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68aea758, cpuid=0x00a50f00, bank=0x00000012
95 2025-09-01 09:41:50 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68b80667, cpuid=0x00a50f00, bank=0x00000012
(this is `sudo ras-mc-ctl --errors` output)It's always the same address, and always a Corrected Error (obviously, otherwise my kernel would panic). However, operating my system's memory at this clock and latency boosts x265 encoding performance (just one of the benchmarks I picked when trying to figure out how to handle this particular tradeoff) by about 12%. That is an improvement I am willing to stomach the extra risk of effectively overclocking the memory module beyond its comformt zone for, given that I can fully mitigate it by virtue of properly working ECC.
This was running at like, 1866 or something. It's a pretty barebones 8th gen i3 with a beefier chipset, but ECC still came in clutch. I won't buy hardware for server purposes without it.
Edit: it's probably because I switched it to "energy efficiency mode" instead of "performance mode" because it would occasionally lock up in performance mode. Presumably with the same root cause.
Last winter I was helping someone put together a new gaming machine... it was so frustrating running into the fake ecc marketing for DDR5 that you mention. The motherboard situation for whether they support it or not, or whether a bios update added support or then removed it or added it back or not, was also really sad. And even worse IMO is that you can't actually max out 4 slots on the top tier mobos unless you're willing to accept a huge drop in RAM speed. Leads to ugly 48 GB sized sticks and limiting to two of them... In the end we didn't go with ECC for that someone, but I was pretty disappointed about it. I'm hoping the next gen will be better, for my own setup running ZFS and such I'm not going to give up ECC.
Some vendors use hamming codes with “holes” in them, and you need the CPU to also run ECC (or at least error detection) between ram and the cache hierarchy.
Those things are optional in the spec, because we can’t have nice things.
I wish AMD would make ECC a properly advertised feature with clear motherboard support. At least DDR5 has some level of ECC.
That is mostly to assist manufacturers in selling marginal chips with a few bad bits scattered around. It's really a step backwards in reliability.
Does anyone maintain a list with de-facto support of amd chips and mainboards? That partlist site only shows official support IIRC, so it won't give you any results.
However in the past there have existed very few CPU models and MBs that supported either kind of DIMMs, while today this has become completely impossible, as the mechanical and electrical differences between them have increased.
In any case, today, like also 20 years ago, when searching for ECC DIMMs you must always search only the correct type, e.g. unbuffered ECC DIMMs for desktop CPUs.
In general, registered ECC DIMMs are easier to find, because wherever "server memory" is advertised, that is what is meant. For desktop ECC memory, you must be careful to see both "ECC" and "unbuffered" mentioned in the module description.
In my experience, it's generally unwise to push the platform you're on to the outermost of its spec'd limits. At work, we bought several 5950X-based Zen3 workstations with 128GB of 3200MT/s ECC UDIMM, and two of these boxes will only ever POST when you manually downclock memory to 3000MT/s. Past a certain point, it's silicon lottery deciding if you can make reality live up to the datasheets' promises.
edit: Looks like a lot of Asus motherboards work, and the thing to look for is "unbuffered" ECC. Kingston has some, I see 32GB module for $190 on Newegg.
I have followed his blog for years and hold him in high respect so I am surprised he has done that and expected stability at 100C regardless of what Intel claim is okay.
Not to mention that you rapidly hit diminishing returns pass 200W with current gen Intel CPUs, although he mentions caring able idle power usage. Why go from 150W to 300W for a 20% performance increase?
Given the motherboard and RAM will also generate quite some heat, if the case fan profile was conservative (he does mention he likes low noise), could be the insides got quite toasty.
Back when I got my 2080 Ti, I had this issue when gaming. The internal temps would get so hot due to the blanket effect of the padding I couldn't touch the components after a gaming session. Had to significantly tweak my fan profiles. His CPU at peak would generate about the same amount of heat as my 2080 Ti + CPU I had then, and I had the non-Compact case with two case fans.
[1]: https://michael.stapelberg.ch/posts/2025-05-15-my-2025-high-...
I also have a fractal define case with anti noise padding material and dust filters, but my temperatures are great and the computer is almost inaudible even when compiling code for hours with -j $(nproc). And my fans and cooler are much cheaper than his.
That should of course be sound padding...
Intel specifies a max operating temperature of 105°C for the 285K [1]. Also modern CPUs aren't supposed to die when run with inadequate cooling, but instead clock down to stay within their thermal envelope.
[1]: https://www.intel.com/content/www/us/en/products/sku/241060/...
Because CPUs can get much hotter in specific spots at specific pins no? Just because you're reading 100, doesn't mean there aren't spots that are way hotter.
My understanding is that modern Intel CPUs have a temp sensor per core + one at package level, but which one is being reported?
Anyway, OP's cooler should be able to cool down 250W CPUs below 100C. He must have done something wrong for this to not happen. That's my point -- the motherboard likely overclocked the CPU and he failed to properly cool it down or set a power limit (PL1/PL2). He could have easily avoided all this trouble.
And yeah, having Arrow Lake running at its defaults is just a waste of energy. Even halving your TDP just loses you roughly 15% performance in highly MT scenarios...
I did not overclock this CPU. I pay attention to what I change in the BIOS/UEFI firmware, and I never select any overclocking options.
Also, I have applied thermal paste properly: Noctua-supplied paste, following Noctua’s instructions for this CPU socket.
https://www.techpowerup.com/review/intel-core-ultra-9-285k/2... lists maximum temperature as 88.2C with the previous gen NH-D15 cooler.
When you do not have a bunch of components ready to swap out it is also really hard to debug these issues. Sometimes it’s something completely different like the PSU. After the last issues, I decided to buy a prebuilt (ThinkStation) with on-site service. The cooling is a bit worse, etc., but if issues come up, I don’t have to spend a lot of time debugging them.
Random other comment: when comparing CPUs, a sad observation was that even a passively cooled M4 is faster than a lot of desktop CPUs (typically single-threaded, sometimes also multi-threaded).
And if we are talking about a passively cooled M4 (MacBook Air basically) it will quite heavily throttle relatively quickly, you lose at the very least 30%.
So, let's not misrepresent things, Apple CPUs are very power efficient but they are not magic, if you hit them hard, they still need good cooling. Plenty of people have had the experience with their M4 Max, discovering that actually, if they did use the laptop as a workstation, it will generate a good amount of fan noise, there is no other way around.
Apple stuff is good because most people actually have bursty workload (especially graphic design, video editing and some audio stuff) but if you hammer it for hours on end, it's not that good and the power efficiency point becomes a bit moot.
I think a lot of it boils down to load profile and power delivery. My 2500VA double conversion UPS seems to have difficulty keeping up with the volatility in load when running that console app. I can tell because its fans ramp up and my lights on the same circuit begin to flicker very perceptibly. It also creates audible PWM noise in the PC which is crazy to me because up til recently I've only ever heard that from a heavily loaded GPU.
For a long time, my Achille's heel was my Bride's vacuum. Her Dyson pulled enough amps that the UPS would start singing and trigger the auto shutdown sequence for the half rack. Took way too long to figure out as I was usually not around when she did it.
You said the right words but with the wrong meaning! On Gigabyte mobo you want to increase the "CPU Vcore Loadline Calibration" and the "PWM Phase Control" settings, [see screenshot here](https://forum.level1techs.com/t/ddr4-ram-load-line-calibrati...).
When I first got my Ryzen 3900X cpu and X570 mobo in 2019, I had many issues for a long time (freezes at idle, not waking from sleep, bios loops, etc). Eventually I found that bumping up those settings to ~High (maybe even Extreme) was what was required, and things worked for 2 years or so until I got a 5950X on clearance last year.
I slotted that in to the same mobo and it worked fine, but when I was looking at HWMon etc, I noticed some strange things with the power/voltage. After some mucking about and theorising with ChatGPT (it's way quicker than googling for uncommon problems), it became apparent that the ~High LLC/power settings I was still using were no good. ChatGPT explained that my 3900X was probably a bit "crude" in relative quality, and so it needed the "stronger" power settings to keep itself in order. Then when I've swapped to 5950X, it happens to be more "refined" and thus doesn't need to be "manhandled" — and in fact, didn't like being manhandled at all!
But if your UPS (or just the electrical outlet you're plugged into) can't cope - dunno if I'd describe that as cratering your CPU.
Yea, but unfortunately it comes attached to a Mac.
An issue I've encountered often with motherboards, is that they have brain damaged default settings, that run CPU's out of spec. You really have to go through it all with a fine toothed comb and make sure everything is set to conservative stock manufacturer recommended settings. And my stupid MSI board resets everything (every single BIOS setting) to MSI defaults when you upgrade its BIOS.
It looks completely bonkers to me. I overclocked my system to ~95% of what it is able to do with almost default voltages, using bumps of 1-3% over stock, which (AFAIK) is within acceptable tolerances, but it requires hours and hours of tinkering and stability testing.
Most users just set automatic overclocking, have their motherboards push voltages to insane levels, and then act surprised when their CPUs start bugging out within a couple of years.
Shocking!
Yeah. If Asahi worked on newer Macs and Apple Silicon Macs supported eGPU (yes I know, big ifs), the choice would be simple. I had NixOS on my Mac Studio M1 Ultra for a while and it was pretty glorious.
I had the same issue with my MSI board, next one won't be a MSI.
My modern CPU problems are DDR5 and the pre-boot timing thing never completing. So a build of a 9700x that I did that WAS supposed to be located remotely from me has to sit in my office and have its hand held thru every reboot cuz you never know quite know when its doing to decide it needs to retime and randomly never come back. Requires pulling the plug from the back and waiting a few minutes then powering back, then waiting 30 minutes for 64gb of ddr5 to do its timing thing.
My system would randomly freeze for ~5 seconds, usually while gaming and having a video in the browser running a the same time. Then, it would reliably happen in Titanfall 2 and I noticed there were always AHCI errors in the Windows logs at the same time so I switched to an NVMe drive.
The system would also shut down occasionally (~ once every few hours) in certain games only. Then, I managed to reproduce it 100% of the time by casting lightning magic in Oblivion Remastered. I had to switch out my PSU, the old one probably couldn't handle some transient load spike, even though it was a Seasonic Prime Ultra Titanium.
I have an M1 Max, a few revisions old, and the only thing I can do to spin up the fans is run local LLMs or play Minecraft with the kids on a giant ultra wide monitor at full frame rate. Giant Rust builds and similar will barely turn on the fan. Normal stuff like browsing and using apps doesn’t even get it warm.
I’ve read people here and there arguing that instruction sets don’t matter, that it’s all the same past the decoder anyway. I don’t buy it. The superior energy efficiency of ARM chips is so obvious I find it impossible to believe it’s not due to the ISA since not much else is that different and now they’re often made on the same TSMC fabs.
This isn't really true. On the same process node the difference is negligible. It's just that Intel's process in particular has efficiency problems and Apple buys out the early capacity for TSMC's new process nodes. Then when you compare e.g. the first chips to use 3nm to existing chips which are still using 4 or 5nm, the newer process has somewhat better efficiency. But even then the difference isn't very large.
And the processors made on the same node often make for inconvenient comparisons, e.g. the M4 uses TSMC N3E but the only x86 processor currently using that is Epyc. And then you're obviously not comparing like with like, but as a ballpark estimate, the M4 Pro has a TDP of ~3.2W/core whereas Epyc 9845 is ~2.4W/core. The M4 can mitigate this by having somewhat better performance per core but this is nothing like an unambiguous victory for Apple; it's basically a tie.
> I have an M1 Max, a few revisions old, and the only thing I can do to spin up the fans is run local LLMs or play Minecraft with the kids on a giant ultra wide monitor at full frame rate. Giant Rust builds and similar will barely turn on the fan. Normal stuff like browsing and using apps doesn’t even get it warm.
One of the reasons for this is that Apple has always been willing to run components right up to their temperature spec before turning on the fan. And then even though that's technically in spec, it's right on the line, which is bad for longevity.
In consumer devices it usually doesn't matter because most people rarely put any real load on their machines anyway, but it's something to be aware of if you actually intend to, e.g. there used to be a Mac Mini Server product and then people would put significant load on them and then they would eat the internal hard drives because the fan controller was tuned for acoustics over operating temperature.
This anecdote perfectly describes my few generation old Intel laptop too. The fans turn on maybe once a month. I dont think its as power efficient as an M-series Apple CPU, but total system power is definitely under 10W during normal usage (including screen, wifi, etc).
One of the many reasons why snapdragon windows laptops failed was both amd and Intel (lunar lake) was able to reach the claimed efficiency of those chips. I still think modern x86 can match arm ones in efficiency if someone bothered to tune the os and scheduler for most common activities. M series was based on their phone chips which were designed from the ground up to run on a battery all these years. AMD/Intel just don't see an incentive to do that nor do Microsoft.
On what metric am I ought to buy a CPU these days? Should I care about reviews? I am fine with a middle-end CPU, for what it is worth, and I thought of AMD Ryzen 7 5700 or AMD Ryzen 5 5600GT or anything with a similar price tag. They might even be lower-end by now?
Intel is just bad at the moment and not even worth touching.
Definetly not that one if you plan to pair with a dedicated GPU! The 5700X has twice the L3 cache. All Ryzen 5000 with a GPU have only 16MB, 5700 has the GPU deactivated.
I also have this issue.
A common approach is to go into the BIOS/UEFI settings and check that c6 is disabled. To verify and/or temporarily turn c6 off, see https://github.com/r4m0n/ZenStates-Linux
I have always run B series because I've never needed the overclocking or additional peripherals. In my server builds I usually disable peripherals in the UEFI like Bluetooth and audio as well.
Twice the memory bandwidth, twice the CPU core count... It's really wacky how they've decided to name things
It is cheaper and more stable. Performance difference doesn’t matter that much too
On desktop PCs, thermal throttling is often set up as "just a safety feature" to this very day. Which means: the system does NOT expect to stay at the edge of its thermal limit. I would not trust thermal throttling with keeping a system running safely at a continuous 100C on die.
100C is already a "danger zone", with elevated error rates and faster circuit degradation - and there are only this many thermal sensors a die has. Some under-sensored hotspots may be running a few degrees higher than that. Which may not be enough to kill the die outright - but more than enough to put those hotspots into a "fuck around" zone of increased instability and massively accelerated degradation.
If you're relying on thermal throttling to balance your system's performance, as laptops and smartphones often do, then you seriously need to dial in better temperature thresholds. 100C is way too spicy.
If nothing else, it very clearly indicates that you can boost your performance significantly by sorting out your cooling because your cpu will be stuck permanently emergency throttling.
Smartphones have no active cooling and are fully dependent on thermal throttling for survival, but they can start throttling at as low as 50C easily. Laptops with underspecced cooling systems generally try their best to avoid crossing into triple digits - a lot of them max out at 85C to 95C, even under extreme loads.
I had an 8th-gen i7 sitting at the thermal limit (~100C) in a laptop for half a decade 24/7 with no problem. As sibling comments have noted, modern CPUs are designed to run "flat-out against the governor".
Voltage-dependent electromigration is the biggest problem and what lead to the failures in Intel CPUs not long ago, perhaps ironically caused by cooling that was "too good" --- the CPU finds that there's still plenty of thermal headroom, so it boosts frequency and accompanying voltage to reach the limit, and went too far with the voltage. If it had hit the thermal limit it would've backed off on the voltage and frequency.
No. High performance gaming laptops will routinely do this for hours on end for years.
If it can't take it, it shouldn't allow it.
Intel's basic 285K spec's - https://www.intel.com/content/www/us/en/products/sku/241060/... - say "Max Operating Temperature 105 °C".
So, yes - running the CPU that close to its maximum is really not asking for stability, nor longevity.
No reason to doubt your assertion about gaming laptops - but chip binning is a thing, and the manufacturers of those laptops have every reason to pay Intel a premium for CPU's which test to better values of X, Y, and Z.
I've never overclocked anything and I've never felt I've missed out in any way. I really can't imagine spending even one minute trying to squeeze 5% or whatnot tweaking voltages and dealing with plumbing and roaring fans. I want to use the machine, not hotrod it.
I would rather Intel et al. leave a few percent "on the table" and sell things that work, for years on end without failure and without a lot of care and feeding. Lately it looks like a crapshoot trying to identify components that don't kill themselves.
- cheap ULV chips like N100, N150, N300
- ultrabook ULV chips (I hope Lunar Lake is not a fluke)
- workstation chips that aren't too powerful (mainstream Core CPUs)
- inexpensive GPUs (a surprising niche, but excruciatingly small)
AMD has been dominating them in all other submarkets.Without a mainstream halo product Intel has been forced to compete on price, which is not something they can afford. They have to make a product that leapfrogs either AMD or Nvidia and successfully (and meaningfully) iterate on it. The last time they tried something like that was in 2021 with the launch of Alder Lake, but AMD overtook them with 3D V-Cache in 2022.
But I just can't bring myself to upgrade this year. I dabble in local AI, where it's clear fast memory is important, but the PC approach is just not keeping up without going to "workstation" or "server" parts that cost too much.
There are glimmers of hope with MR-DIMMs CU-DIMM, and other approaches, but really boards and CPUs need to support more memory channels. Intel has a small advantage over AMD, but it's nothing compared to the memory speed of a Mac Pro or higher. "Strix Halo" offers some hope with four memory channel support, but it's meant for notebooks so isn't really expandable (which would enable à la carte hybrid AI; fast GPUs with reasonably fast shared system RAM).
I wish I could fast forward to a better time, but it's likely fully integrated systems will dominate if the size and relatively weak performance for some tasks makes the parts industry pointless. It is a glaring deficiency in the x86 parts concept and will result in PC parts being more and more niche, exotic and inaccessible.
That being said, for AI, HEDT is the obvious answer. Back in the day, it was much more affordable with my 9980XE only costing $2,000.
I just built a Threadripper 9980 system with 192GB of RAM and good lord it was expensive. I will actually benefit from it though and the company paid for it.
That being said, there is a glaring gap between "consumer" hardware meant for gaming and "workstation" hardware meant for real performance.
Have you looked into a 9960 Threadripper build? The CPU isn't TOO expensive, although the memory will be. But you'll get a significantly faster and better machine than something like a 9950X.
I also think besides the new Threadripper chips, there isn't much new out this year anyways to warrant upgrading.
Competitors to NVidia really need to figure things out, even for gaming with AI being used more I think a high end APU would be compelling with fast shared memory.
It seems like large, unchallenged organizations like Intel (or NASA or Google) collect all the top talent out of school. But changing budgets, changing business objectives, frozen product strategies make it difficult for emerging talent to really work on next-generation technology (those projects have already been assigned to mid-career people who "paid their dues").
Then someone like Apple Silicon with M-chip or SpaceX with Falcon-9 comes along and poaches the people most likely to work "hardcore" (not optimizing for work/life balance) while also giving the new product a high degree of risk tolerance and autonomy. Within a few years, the smaller upstart organization has opened up in un-closeable performance gap with behemoth incumbent.
Has anyone written about this pattern (beyond Innovator's Dilemma)? Does anyone have other good examples of this?
I gather it's very difficult and expensive to make a board that supports more channels of RAM, so that seems worth targeting at the platform level. Eight channel RAM using common RAM DIMMs would transform PCs for many tasks, however for now gamers are a main force and they don't really care about memory speed.
How do you sell your systems when their time comes?
I use Arch, btw ;)
https://www.theregister.com/2025/08/29/amd_ryzen_twice_fails...
Sufficient cooler, with sufficient airflow is always needed.
The 13900k draws more than 200W initially and thermal throttles after a minute at most, even in an air conditioned room.
I don't think that thermal problems should be pushed to end user to this degree.
So if your CPU is drawing "more than 200W" you're pretty much at the limits of your cooler.
But I agree this should not be a problem in the first place.
For both the cooler and the motherboard, AMD have too much control to look the other way. The chip can measure its own temperature and the conceit of undermining partners by moving things on chip and controlling more of the ecosystem is that things perform better. They should at least perform.
I also find that, as performance improvements tolerances get tighter throughout the system, the set of 'things that can screw your build' grows bigger.
The problem is, it's a huge effort to get there. You really have to tune PBO curves for each core individually, as they can vary so much between cores.
Now the test itself is mostly automatic with tools like OCCT, but of course you have to change the settings in the BIOS between each test and you cannot use the computer during that time, so there's a huge opportunity cost. I'm talking about weeks, not days.
To cut a long story short, I sold the system and just bought a M4 Max Mac Studio now. Apple Silicon might not have the top performance of AMD or Intel, but it comes with much less headaches and opportunity cost. Which in the end probably equalizes the difference in purchase cost.
If anyone thinks competition isn't good for the market or that also-rans don't have enough of an effect, just take note. Intel is a cautionary tale. I do agree we would have gotten where we are faster with more viable competitors.
M4 is neat. I won't be shocked if x86 finally gives up the ghost as Intel decides playing in Risc V or ARM space is their only hope to get back into an up-cycle. AMD has wanted to do heterogeneous stuff for years. Risc V might be the way.
One thing I'm finding is that compilers are actually leaving a ton on the table for AMD chips, so I think this is an area where AMD and all of the users, from SMEs on down, can benefit tremendously from cooperatively financing the necessary software to make it happen.
Secondly, what BIOS settings should I be using to run safely? Is XMP/whatever the AMD equivalent is safe? If I don't run XMP then my RAM runs at way below spec (for the stick) default speeds.
Anyone know of a good guide for this stuff?
Maybe the situation is better on DDR5 platforms.
Yet I also use a 7840U in a gaming handheld running Windows, and haven't had any issues there at all. So I think this is related to AMD Linux drivers and/or Wayland. In contrast, my old laptop with an NVIDIA GPU and Xorg has given me zero issues for about a decade now.
So I've decided to just avoid AMD on Linux on my next machine. Intel's upcoming Panther Lake and Nova Lake CPUs seem promising, and their integrated graphics have consistently been improving. I don't think AMD's dominance will continue for much longer.
Make sure it matches the min of the actual spec of the ram that you bought and what the CPU can do.
I used to get crashes like you are describing on a similar machine. The crashes are in the GPU firmware, making debugging a bit of a crap shoot. If you can run windows with the crashing workload on it, you’ll probably find it crashes the same ways as Linux.
For me, it was a bios bug that underclocked the ram. Memory tests, etc passed.
I suspect there are hard performance deadlines in the GPU stack, and the underclocked memory was causing it to miss them, and assume a hang.
If the ram frequency looks OK, check all the hardware configuration knobs you can think of. Something probably auto-detected wrong.
Don't know about transcoding though.
Threadripper is built for this. But I am talking about the consumer options if you are on a budget. Intel has significantly more memory bandwidth than AMD in the consumer end. I don't have the numbers on hand, but someone at /r/localllama did a comparison a while ago.
> After switching my PC from Intel to AMD, I end up at 10-11 kWh per day.
It's kind of impressive to increase household electricity consumption by 10% by just switching one CPU.
For a time I ran it 24/7 without suspend. It's a big system, lots of disks, expansion cards, etc. If it doesn't suspend, and doesn't do anything remarkable, it uses about ~5kWh per day. Needless to say, it suspends after 60 minutes now (my daily energy usage went from ~9 to ~4 kWh).
[1]: https://en.wikipedia.org/wiki/European_countries_by_electric...
I had differences of like 20 or more between different cores... i.e. one core might work fine at -20, the other maybe only at +5.
And while all core CO might not be optimal, based on personal experience and what I've seen across multiple enthusiast communities, more often than not you can get an worthwhile improvement to temps/perf with an all core CO.
That being said, there are certainly ways to find and set the best CO values per core, but it will certainly take more effort, stress testing and time.
Pass -fuse=mold when building.
I recently hit this testing pre-release kernels on my gaming PC, a 9900X3D: https://lore.kernel.org/lkml/20250623083408.jTiJiC6_@linutro...
A pile of older Skylake machines was never able to reproduce that bug one single time in 100+ hours of running the same workload. The fast new AMD chips would almost always hit it in a few hours.
> I get the general impression that the AMD CPU has higher power consumption in all regards: the baseline is higher, the spikes are higher (peak consumption) and it spikes more often / for longer.
> Looking at my energy meter statistics, I usually ended up at about 9.x kWh per day for a two-person household, cooking with induction.
> After switching my PC from Intel to AMD, I end up at 10-11 kWh per day.
It's been the bane of desktop AMD CPUs since Zen 1. Hopefully AMD will address this in Zen 6 but I don't have too much hope.
Zen APUs have no such issue.
My 7840HS idles at 3W when plugged in and around 0.5W when running on battery power.
https://www.reddit.com/r/Amd/comments/1brs42g/amd_please_tac...
I don't bloody care that AMD CPUs seem to be more power efficient than Intel's. For most people their CPUs are completely idle most of the time and Zen CPUs on average idle at 25W or MORE.
Many Zen 4 and Zen 5 owners report that their desktop CPUs idle at 40W or more even without the 3D cache.
A big surprise for me, having owned both a Ryzen gen 1 & 3 previously, was that this time my system posted without me needing to flash my BIOS or play around with various RAM configurations. Felt like magic.
An ideal ambient (room) temperature for running a computer is 15-25 celcius (60-77 Fahrenheit)
Source: https://www.techtarget.com/searchdatacenter/definition/ambie...
It is actually 2.9999, precisely.
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +40.0°C (high = +80.0°C, crit = +100.0°C)
Core 0: +38.0°C (high = +80.0°C, crit = +100.0°C)
Core 1: +39.0°C (high = +80.0°C, crit = +100.0°C)
Are they saying this is bad? This Intel CPU has been at it for over a decade. There was a fan issue for half a year and would go up to 80 C for... half a year. Still works perfectly fine but it is outdated, it lacks instruction sets that I need, and it has two cores only, and 1 thread per core.Maybe today's CPUs would not be able to handle it, I am not sure. One would expect these things to only improve, but seems like this is not the case.
Edit: I misread it, oops! Disregard this comment.
I'd say that even crashing at max temperatures is still completely unreasonable! You should be able to run at 100C or whatever the max temperature is for a week non-stop if you well damn please. If you can't, then the value has been chosen wrong by the manufacturers. If the CPU can't handle that, the clock rates should just be dialed back accordingly to maintain stability.
It's odd to hear about Core Ultra CPUs failing like that, though - I thought that they were supposed to be more power efficient than the 13th and 14th gen, all while not having their stability issues.
That said, I currently have a Ryzen 7 5800X, OCed with PBO to hit 5 GHz with negative CO offsets per core set. There's also an AIO with two fans and the side panel is off because the case I have is horrible. While gaming the temps usually don't reach past like 82C but Prime95 or anything else that's computationally intensive can make the CPU hit and flatten out at 90C. So odd to have modern desktop class CPUs still bump into thermal limits like that. That's with a pretty decent ambient temperature between 21C to 26C (summer).
Chips are happy to run at high temperatures, that's not an issue. It's just a tradeoff of expense and performance.
Servers and running things at scale are way different from consumer use cases and the cooling solutions you'll find in the typical desktop tower, esp. considering the average budget and tolerance for noise. Regardless, on a desktop chip, even if you hit tJMax, it shouldn't lead to instability as in the post above, nor should the chips fail.
If they do, then that value was chosen wrong by the manufacturer. The chips should also be clocking back to maintain safe operating temps. Essentially, squeeze out whatever performance is available with a given cooling solution: be it passive (I have some low TDP AM4 chips with passive Alpine radiator blocks), air coolers or AIOs or a custom liquid loop.
> What Intel is doing and what they are recommending is the act of a desperate corporation incapable of designing energy-efficient CPUs, incapable of progressing their performance in MIPS per Watt of power.
I don't disagree with this entirely, but the story is increasingly similar with AMD as well - most consumer chip manufacturers are pushing the chips harder and harder out of the factory, so they can compete on benchmarks. That's why you hear about people limiting the power envelope to 80-90% of stock and dropping close to 10 degrees C in temperatures, similarly you hear about the difficulties of pushing chips all that far past stock in overclocking, because they're already pushed harder than the prior generations.
To sum up: Intel should be less delusional in how far they can push the silicon, take the L and compete against AMD on the pricing, instead of charging an arm and a leg for chips that will burn up. What they were doing with the Arc GPUs compared to the competitors was actually a step in the right direction.
TSMC (AMD's fab), is heavily based in Taiwan, which has its own implications regarding long-term sustainability and monopoly.
With only two real choices for x86, and the complexity of the global supply chain, it hardly seems like a fair comparison.
I got an i5 13600KF last black friday (with a long haul to Hong Kong for about 2 weeks) from Amazon, with initially a budget motherboard that I thought would be fine, and it turns out the system would keep turning off at one point and reboot again with a huge drop in voltage (it was about 10 months later that I learned this is a brownout).
It was for my company computer, but I bought it personally, so the ownership is still mine. I then bought a new SF750 PSU at home and swapped the CPU for 13100 salvaged from a computer someone donated, so now the 13600KF would be my personal gaming rig.
I made sure it gets a platform that sustain enough power and appropriate headroom for thermals, and it was all fine until 6 months ago, it starts to BSOD all over the place, when gaming; programming; or even just resume from suspend. I have to refund two games because of this, one is accepted and the other isn't. And also turn over to cloud machine for development because BSOD in the middle of debugging is really nasty.
So I decided to say "fuck it, I'm going back to AMD". I actually still use my 3700X gig a year ago but I figured the 5 year old system is now becoming an old dog. I just can't run most modern game at even 80FPS, so I swapped to the 13600KF as an intermediate replacement until it glitched up, so I need another replacement again.
Coincidentally I bought a 7945HX engineering sample ITX motherboard originally for the intent of running Kubernetes homelab (now that I think about it, a big waste of money indeed, yikes). Then I have a eureka moment: why don't I just use that 7945HX plus the 96GB DDR5 that I bought?
So after a painful assemble-reassemble process, I'm back to AMD once again -- it was almost perfect, scoring almost exactly as a 5950X, but only at around 100W TDP for the total package, with almost double the CPU cache, plus it is not the Zen 5/Zen 5c design which complicates CPU scheduling, I have been able to solve the gaming-productivity dilemma at the same time -- and the MoDT motherboard itself is just shy of ~1800HKD in total, which is less than the 5950X CPU alone plus I have a huge TDP headroom for the 9070XT I purchased also in June -- almost complete silent platform with Noctua, too.
The original 13600KF has been redelivered back to my company with a new 800W PSU and a new case specifically bought to fit the wood aesthetic, and another AMD GPU I salvaged from my NUC (6600XT Challenger, but single fan), but this time it runs surprisingly fine -- no kernel panic or PSU brownout just yet.
After all this in a short span of 10 months, I guess I just reached my own "metastability" now -- Intel CPU for office work, AMD for gaming and workstation.
The old 3700X system is being repurposed again for running cheap Kubernetes homelab and I guess this time too it is worth the right place. I don't think I ever need to have a new purchase again for the coming few years, hopefully.
The only problem would be that I'm using an engineering sample rather than the normal version of 7945HX -- the normal one can reach up to 5.4GHz boost but mine only got 5.2GHz boost, at a cost of 600HKD difference, I would say it is not worth it to upgrade to the normal version, no?
Besides AMD CPUs of the early 2000s going up in smokes without working cooling, they all throttle before they become temporarily or permanently unstable. Otherwise they are bad.
I've never had a desktop part fail due to max temperatures, but I don't think I've owned one that advertises nor allows itself to reach or remain at 100c or higher.
If someone sells a CPU that's specified to work at 100 or 110 degrees and it doesn't then it's either defective or fraudulent, no excuses.
Max Operating Temperature: 105 °C
14900k: https://www.intel.com/content/www/us/en/products/sku/236773/...
Max Operating Temperature: 100 °C
Different CPUs, different specs.
And any CPU from the last decade will just throttle down if it gets too hot. That's how the entire "Turbo" thing works: go as fast as we can until it gets too hot, after which it throttles down.
ah if only they had incremented that number by one… a new 286 even just in name would be sooo funny… not as funny as bringing back the number 8088 of course
Theoretically that’s likely true. But is there any empirical evidence?
Even underclocked Intel desktop chips are massively faster.