EPYC 7002 CPUs may hang after 1042 days of uptime (opens in new tab)

(old.reddit.com)

159 pointsgfv2y ago104 comments

104 comments

I feel like some of the comments here are missing the point. Yes it's only likely to effect a small number of users, so did the intel fdiv bug, both are defective products.

Back then intel were pressured into a recall, today we seem too willing to put up with being sold broken stuff.

Waterluvian2y ago

It feels a little bit different.

One creates uncertainty in all floating point results, given you don’t know when it happens. The other requires you to reboot maybe every ~3 years and you know exactly when it happens.

I’m not saying we should tolerate a defect, but it doesn’t feel nearly as problematic.

arp2422y ago

It also has a fairly easy solution: disable the CC6 sleep state. The practical effects from that will most likely be minimal or non-existent for most users of these CPUs.

bioemerl2y ago

> disable the CC6 sleep state.

This is now the second time AMD has screwed up the C6 state. Ryzen first gen would hang daily for me when due to a similar bug.

2 more replies

touisteur2y ago

I can already see the pain of the myriads of compliance (to all energy reduction directives, at least in EU) people getting strangely obtuse notes from their sw/hw/platform teams, saying in essence, errrrr we need to amend our already thick justification folder, to disable a specific sleep state. I feel a migraine (or a kind of sketch) coming. 'oh and BTW we're field upgrading the whole fleet'.

I guess fighting tooth and nail to disable any and all of these sleep states from the get go is worth it...

1 more reply

gchamonlive2y ago

It means anyone launching amd powered virtual machines on cloud providers can experience this now, at any point, and you don't know when it will happen, given this type of CPU could have been bought, booted or rebooted anytime in the past three years.

Seems comparably problematic to me.

PragmaticPulp2y ago

This depends on the C6 sleep state being enabled and the server not having been restarted for 3 years. It’s extremely unlikely that your cloud provider servers are going to meet this criteria and ignore this errata for a part they’ve bought thousands of.

So no, it’s not going to start randomly hitting people.

> Seems comparably problematic to me.

Not even close. The FDIV bug hit common operations that could be issued millions of times per second. This bug only applies to specific configurations that haven’t been rebooted for 3 years and has a clear workaround.

They’re not even close to comparable in impact and ability to work around. Literally many orders of magnitude different.

1 more reply

tpetry2y ago

The cloud providers now know of this bug. They will live migrate you to a different machine or shutdown reboot. Only on-premise will have this issue.

2 more replies

Waterluvian2y ago

I’d prefer a bug that crashes a program than one that quietly inserts wrong data and keeps going.

2 more replies

viraptor2y ago

> amd powered virtual machines on cloud providers

Cloud providers are very unlikely to use sleep states. I mean, is possible... but I'd bet against it.

1 more reply

callamdelaney2y ago

I think 30 seconds of downtime over 3 years probably isn't that much of an issue for anybody. Floating point calculations being wrong though.. that's a bigger problem.

1 more reply

Sakos2y ago

Google already does preemptive VMs where an instance can go down if it's needed elsewhere. It's something you can design your services to handle easily, if you aren't already doing so.

Why wouldn't cloud providers be aware of how long a specific CPU has been up and plan around it? Also, do cloud providers generally never reboot their systems?

2 more replies

0xr0kk3r2y ago

To be fair, it was possible to tell what operations would be off in the FDIV bug and by how much. It was 100% deterministic. Problem was, checking all the operands in SW before performing the computation to make adjustments completely defeated the purpose of having an FPU.

KptMarchewa2y ago

Especially compared to something like this: https://www.theregister.com/2020/04/02/boeing_787_power_cycl...

gfvOP2y ago

There are always bugs in silicon, just like there are bugs in software. They mostly show up under "a highly specific and detailed set of internal timing conditions". There are 40 documented erratas on EPYC 7002s alone; there are 35 in the 13gen Intel CPUs, including, curiously, RPL038, "Processor Exiting Package C6 or C8 May Hang". Mobile ARM chip manufacturers are notoriously bad at documenting their bugs, so who knows how many they have.

This one is interesting because its preconditions are so trivial, and it will affect many more people than usual.

PragmaticPulp2y ago

> Back then intel were pressured into a recall, today we seem too willing to put up with being sold broken stuff.

This bug only applies to servers that haven’t been rebooted for 3 years and have the CC6 sleep state enabled. It can be worked around by disabling CC6 sleep state or rebooting once every 3 years.

If you think operators of these servers can’t be bothered to update and reboot their machines once in 3 years or change a single BIOS setting, what makes you think they’d be interested in tearing down their servers, physically replacing the CPU, and reassembling all of them with the associated downtime and inevitable accidental damage to some units? Nothing about that makes sense from a business perspective.

xp842y ago

I’m picturing a long 50’ aisle filled with racks and a guy with a huge box marked “replacement CPUs” and a screwdriver.

Good lord, can you imagine how long just a few of those would take in a data center?

acomjean2y ago

I remember coming to work one morning and having staff at two tables with boxes of RSA keys, and swapping everyones...

(they replaced 40 million of those things..)

https://arstechnica.com/information-technology/2011/06/rsa-f...

lanstin2y ago

The old every CPU is sacred idea lives on.

dspillett2y ago

A key difference between then and now is how much easier it is to distribute software/firmware workarounds or fixes. From an end users perspective replacing the CPU might be seen as far easier than updating their software. A software fix would affect performance, so of course it isn't as simple as that, but this difference is part of the dynamic.

Also, as a direct user of the CPU, if the fdiv bug would impact you it would affect you often rather than once every three years which is the impact frequency of this fault.

Another matter that affected the fdiv bug is that the Pentium line was the first time a CPU had been aggressively marketed directly at the general public in quite the way it was. Prior to that only manufacturers and techies would have known about it and they were used to errata for hardware components. The public more generally had an impression that hardware (at least undamaged hardware) was reliable and only software had bugs, and the fdiv bug invalidated that view of reality causing a bit of a panic.

jeroenhd2y ago

These types of bugs have been in hardware forever. Nobody is going to replace hundreds of EPYC servers even if they could get a free replacement from AMD.

There are definitely cases where hardware should be exchanged with fixed chips, particularly the small business/consumer/hobbyist range where exchanging CPUs is worth the time and effort. The RDRAND problem with Ryzen chips was much worse because it actually happened all the time and there is still no microcode fix available for some motherboards (though AMD already makes the fix available so it's more of an issue about a lack of motherboard support than broken hardware).

znpy2y ago

> today we seem too willing to put up with being sold broken stuff.

i remember reading that when hard disks just came into the mass market they were so expensive that having some bad sectors was not such a big deal... and so hard disk would usually come with a sheet of paper listing the known broken sectors (detected at QA stage, i guess).

maybe someone older than me (i guess somebody in their 50ies or 60ies) could confirm that.

bitwize2y ago

I'm not that old, but I remember seeing bad sector lists as stickers on some hard disks.

I'm not sure if that ever went away, though... I think the IDE firmware in more modern hard disks knew how to redirect bad sectors to good sectors, so the end user never even noticed.

sidewndr462y ago

I'm way too young to remember it clearly but from what I was told it was nothing of the sort. Intel announced that they had identified a bug and would review on a case by case basis to see who was affected and would determine if you were worthy of getting a CPU that was fixed.

Again, this is secondhand but from people who worked directly in the industry at the time.

atmavatar2y ago

Don't divide, Intel inside!

Neil442y ago

It seems a C6 state is an individual core sleeping. The intersection of people who don't reboot for 3 years and people who have sleep states enabled must be pretty small. It's an interesting bug though!

vegardx2y ago

I had a very similar issue with some AMD-based servers (bulldozer, I think) about ten years ago. There was a bug where Xen-based virtual machines could set a C-state on cores it was assigned, but for whatever reason it wasn't able to wake them up. It was fun trying to figure out what the heck was going on.

icybox2y ago

I have C-states already disabled because of old linux kernel bug where the kernel hang on Zen3 architecture. So not much to see here :)

nicolaslem2y ago

Do you mean a bug in an old version of linux that is now fixed? Because I have been using Zen3 and Zen3+ on linux since their release and never had to mess with C-states.

lUserAMD2y ago

EPYCs have many cores, and most applications (including those with long uptime requirements) use only a subset of the cores continuously. So it is totally normal for some of the cores to go to deep sleep C6 during phases of lower load. It will cause server operators headaches, when those cores don't come back eventually. Reboots help, disabling C6 in the (already running) OS also helps.

Please note, that we are not talking about a core sleeping for three years. We are talking about a core going to deep sleep, when the system has been up for three years or longer.

neilv2y ago

Reminds me of the Intel Atom C2000 series brickings, circa 2017.

https://www.anandtech.com/show/11110/semi-critical-intel-ato...

https://www.servethehome.com/intel-atom-c2000-series-bug-qui...

nh22y ago

I filed a kernel bug 'System thrashes with "AMD-Vi: Completion-Wait loop timed out" after 247 days of uptime'

for AMD Ryzen 7 3700X

https://bugzilla.kernel.org/show_bug.cgi?id=217257

Might this be potentially related?

eqvinox2y ago

MSR-poking Tool for Zen1 Ryzen CPUs to disable C6: https://github.com/r4m0n/ZenStates-Linux/blob/master/zenstat...

Not sure if this is applicable to EPYC CPUs, probably not. But I would expect that it's possible to disable C6 in some similar way on EPYC CPUs without rebooting the system. (If you are actually at risk of running into this issue, you likely don't want to reboot the system…)

tedunangst2y ago

The good news is now I know why my server crashed last month, and it wasn't some other defect.

SpaghettiCthulu2y ago

You've had a Ryzen 7000 series CPU running for nearly 3 years already?

mattpallissard2y ago

This happened with a higher end Cisco switch (the model escapes me) we used in our core many moons ago. Stopped passing traffic completely after a number of days.

At least Cisco told us about it themselves. We just fail-over rebooted until they fixed it.

msla2y ago

Previously:

https://news.ycombinator.com/item?id=28340101 Watch Windows 95 crash live as it exceeds 49.7 days uptime [video]

bushbaba2y ago

In general it is good practice to have machines hard-restart every now and then. Otherwise you run into some weird edge-cases and rely too much on things being up and running 24x7x365

dale_glass2y ago

A machine staying up for almost 3 years is irresponsible in this day and age.

Yeah, I remember people having uptime competitions on Slashdot and the like some decades back, but you only need to look at the ssh logs of a 5 minutes old machine to realize this is a terrible idea in modern times.

hardware2win2y ago

I dont understand opinons like this

Just because it would be dangerous for your nodejs web_app.exe running on ubuntu behind apache fully exposed on the internet

then there are billion other ways to use computers, like even air gapped systems.

So, dont try to justify obvious flaw

dale_glass2y ago

I mean, hardware is cheap enough that any server of importance should be individually disposable.

Yeah, you can do stuff to maximize uptime but if it needs to stay up that badly you have to consider the case of the hardware needing to be turned off at some point.

> So, dont try to justify obvious flaw

I'm not, it's a bug and should be fixed. But I think if anything is powered for 3 years straight it's a bit concerning.

Otherwise you're liable to find things like that somebody started something by hand 2 years ago, and at a critical moment nobody quite remember what the command was.

PedroBatista2y ago

You live in your own World with other people. Please just keep in mind there are many other Worlds with other people and laws of the Universe.

I don't know if you're young or don't know much about history but what you describe is a fairly recent way of looking at things, it's not the only one and I guarantee you it will become "out of fashion".

1 more reply

edude032y ago

As an additional data point -

I have ~1000 7002 cores in my home DC (8 dual socket R7525s with 48-64 cores each) that run kubernetes but are connected to a battery backup and use kexec to perform upgrades. So, while I am very bought into the cattle not pets philosophy, it's rare that any of these machines need to be turned off and I could see them being on for three years continuously without problem otherwise.

defrost2y ago

> But I think if anything is powered for 3 years straight it's a bit concerning.

Pretty much why Pawsey has an Annual High Voltage inspection shutdown [1]

> Otherwise you're liable to find things like [..]

TBH that's not really been an issue of note at any of the big iron farms I've been around since the 1980s .. generally there's a disciplined approach to maintaining 24/7/365 operation (that includes scheduled downtime for equipment checks) part of which is process documentation and justification and soft means of freezing | migrating processes+data etc.

[1] https://status.pawsey.org.au/incidents/tk5n5y965r5j

gfvOP2y ago

Individually disposable, yes. But if you have a cluster of those, and you powered them on at the same time -- as it often happens -- you're in for an exciting ride when your servers start rebooting almost simultaneously, give or take a few minutes.

ly3xqhl8g92y ago

3 years is irresponsible? To quote Logan Roy, you, software developers, "are not serious people" [1]. Just out of curiosity looked for a list of longest running electrical devices [2]:

    1840 - The Oxford Electric Bell
    1871 – Souter Lighthouse in South Shields, UK
    1896 – The Isle of Man’s Manx Electric Railway
    1902 – The Centennial Bulb

Apparently, "The Centennial Bulb has seen just two interruptions: for a week in 1937 when the Firehouse was refurbished, and in May 2013 when it was off for nine and a half hours due to a failed power supply."

[1] https://www.youtube.com/watch?v=LZTaXjt2Ggk

[2] https://www.drax.com/electrification/4-of-the-longest-runnin...

dathinab2y ago

yes 3 years without hardware reset of a component not designed for long term high reliably use is irresponsible (the are servers fir very high reliability, they are just WAY more expensive)

BUT this doesn't mean you need to have downtime, in the same way a train unit in a railway system going through maintenance doesn't mean your railway system has downtime.

Redundancy is a must have feature for reliable systems and that means you system must be able to cope with random hardware failure or rebooting a server unit.

And both planned and unplanned maintenance of components are important normal business which in a well desingned reliable system should not lead to downtime.

Similar testing failure cases is important and should be done.

so either you don't run a high reliably system (and likely don't run into this bug ever), or you run a proper reliable system (and it's not a big deal), or you run a badly desingned or operated system pretending to be high reliably but but really being that... which is irresponsible (if you are aware)

dale_glass2y ago

Those are completely trivial complexity-wise compared to a modern server, and many don't have a real function, and mostly are artificially maintained as a curiosity.

I mean, the centennial bulb barely glows, that's why it still works. The hotter the filament gets the faster it evaporates, so a light bulb that barely makes any light can stay working forever.

ly3xqhl8g92y ago

Sure, was looking for electrical devices, a better example of what great engineering can achieve I suppose it's the Pons Fabricius [1], bridge built 2,085 years ago, still in use.

The problem is, if we can't expect software to run essentially forever, to update without 'restarts', and so forth, how are we ever going to achieve neural chip implants, artificial organs, synthetic agents mining ore in outer space, and so on? Software is not a gear mechanism, a rack and pinion, there is absolutely no reason to restart an 'operating system' or to ever lose state, however we became accustomed and we commit these sort of crimes daily, restarts and refreshes.

[1] https://en.wikipedia.org/wiki/Pons_Fabricius

1 more reply

oynqr2y ago

You don't need full system resets to get security updates. Kexec, live patching, userspace reboot.

cesarb2y ago

> A machine staying up for almost 3 years is irresponsible in this day and age. [...] but you only need to look at the ssh logs of a 5 minutes old machine to realize this is a terrible idea in modern times.

You don't need to reboot a machine to update ssh.

You only need to reboot the machine to update the kernel; for everything else, you just have to restart the corresponding user-space processes (and even PID1 can re-exec itself). Most kernel vulnerabilities are not remotely exploitable, so as long as you can trust your user-space processes (and keep them updated), it should be safe enough.

kjs32y ago

As I recall, machines made by Tandem Computers, among other highly fault tolerant machines that have regrettably fallen out of fashion, didn't have to reboot even to replace the kernel. They didn't run Linux, tho.

sokoloff2y ago

Air gapped machines and kernel live patching both exist.

cpach2y ago

And how many people use that? Most servers today are not air-gapped.

agentgumshoe2y ago

How many examples will you need before you say "oh ok, I can see some valid concerns."?

I've worked in places where expensive Lab equipment is running off outdated PCs/servers because updates aren't available and they will absolutely stay on for as long as possible.

We're not all silicon valley, things can be expensive and difficult to replace...

BasedAnon2y ago

I have kernel live patching on my mother's computer because it means she has to know how to do less

j16sdiz2y ago

Most server don't do that, but those that do are not crazy

kuratkull2y ago

Are you perhaps a Windows user? In the Linux world updates don't necessarily require reboots.

dale_glass2y ago

Actually as of late, Linux has been moving towards rebooting for update.

Yeah, you technically can replace on-disk files while services are running.

In practice this can cause trouble if an application wants to read an updated file at the wrong time, and library dependencies can require restarting a lot of stuff.

For ages people would install an update containing a security fix in glibc or libz or something, and keep on running the vulnerable version of the services that use them.

At that point you might as well reboot.

Modern Fedora has a very Windows-like mechanism where you reboot to update. You reboot, the system installs updates, then reboots again.

viraptor2y ago

While Fedora did move towards that, it's not the only way. A lot of systems which require high reliability are built to reload correctly.

At a generic system level, for example upgrading Nixos will pull new packages and put them next to the current ones, then reexec where possible. Nginx can replace its master process (SIGUSR2). Telephony software can often reexec and keep connecting open. Etc.

Outside of desktops it's not that uncommon to do seamless live reloads of the whole system.

shrimp_emoji2y ago

I reboot after update just superstitiously.

Also out of superstition, I avoid hibernate -- when I walk away, it's either on and locked or shutdown. (I also did this on Windows; a mixed state just seemed off-puttingly and worryingly complex to me.)

Given what you said, and because I hear hibernation is notoriously buggy on Linux, both superstitions have rewarded me. :D

justinclift2y ago

> Actually as of late, Linux has been moving ...

That's a pretty broad generalisation. Which distro's are you meaning?

1 more reply

arbitrandomuser2y ago

On Arch Linux atleast any external hardware device not already loaded by the kernel will fail to load after a kernel update

gsich2y ago

They do. Kernel and libs require it, unless you want to be unsure if your system is still reboot-safe

shpx2y ago

1042 days ought to be enough for anybody

mhuffman2y ago

Spoken like a true AWS user!

ant6n2y ago

> 1042 days ought to be enough for anybody

"640K ought to be enough for anybody."

Tajnymag2y ago

Not for a server

mihaic2y ago

Not sure if he was sarcastically referencing "640kb should be enough for anything"

mnw21cam2y ago

Kernel bugs are rare. Most (almost every single) vulnerability can be patched without rebooting.

0x02y ago

They're not that rare. Also, there are a lot of other updates that in practice should be followed up with a reboot. For example, any library consumed by systemd (such as openssl) usually requires pid1 to relaunch. For example, debian released an openssl update just yesterday. You can run "checkrestart -v" to try to figure out how to restart every affected app but you'll quickly run into systemd's init process running with the old vulnerable library loaded, and then you might as well just reboot to get a clean "checkrestart -v". Even just relaunching non-pid-1 applications like dbus can quickly create a mess where sshd logins get a delay if you're not careful to also reload everything that depends on it.

Denvercoder92y ago

> For example, any library consumed by systemd (such as openssl) usually requires pid1 to relaunch.

That does not require a reboot, `systemctl daemon-reexec` is enough.

1 more reply

j / k navigate · click thread line to collapse

104 comments

RobotToaster2y ago

I feel like some of the comments here are missing the point. Yes it's only likely to effect a small number of users, so did the intel fdiv bug, both are defective products.

Back then intel were pressured into a recall, today we seem too willing to put up with being sold broken stuff.

Waterluvian2y ago

It feels a little bit different.

One creates uncertainty in all floating point results, given you don’t know when it happens. The other requires you to reboot maybe every ~3 years and you know exactly when it happens.

I’m not saying we should tolerate a defect, but it doesn’t feel nearly as problematic.

arp2422y ago

It also has a fairly easy solution: disable the CC6 sleep state. The practical effects from that will most likely be minimal or non-existent for most users of these CPUs.

bioemerl2y ago

> disable the CC6 sleep state.

This is now the second time AMD has screwed up the C6 state. Ryzen first gen would hang daily for me when due to a similar bug.

2 more replies

touisteur2y ago

I guess fighting tooth and nail to disable any and all of these sleep states from the get go is worth it...

1 more reply

gchamonlive2y ago

Seems comparably problematic to me.

PragmaticPulp2y ago

So no, it’s not going to start randomly hitting people.

> Seems comparably problematic to me.

They’re not even close to comparable in impact and ability to work around. Literally many orders of magnitude different.

1 more reply

tpetry2y ago

The cloud providers now know of this bug. They will live migrate you to a different machine or shutdown reboot. Only on-premise will have this issue.

2 more replies

Waterluvian2y ago

I’d prefer a bug that crashes a program than one that quietly inserts wrong data and keeps going.

2 more replies

viraptor2y ago

> amd powered virtual machines on cloud providers

Cloud providers are very unlikely to use sleep states. I mean, is possible... but I'd bet against it.

1 more reply

callamdelaney2y ago

I think 30 seconds of downtime over 3 years probably isn't that much of an issue for anybody. Floating point calculations being wrong though.. that's a bigger problem.

1 more reply

Sakos2y ago

Google already does preemptive VMs where an instance can go down if it's needed elsewhere. It's something you can design your services to handle easily, if you aren't already doing so.

Why wouldn't cloud providers be aware of how long a specific CPU has been up and plan around it? Also, do cloud providers generally never reboot their systems?

2 more replies

0xr0kk3r2y ago

KptMarchewa2y ago

Especially compared to something like this: https://www.theregister.com/2020/04/02/boeing_787_power_cycl...

gfvOP2y ago

This one is interesting because its preconditions are so trivial, and it will affect many more people than usual.

PragmaticPulp2y ago

> Back then intel were pressured into a recall, today we seem too willing to put up with being sold broken stuff.

This bug only applies to servers that haven’t been rebooted for 3 years and have the CC6 sleep state enabled. It can be worked around by disabling CC6 sleep state or rebooting once every 3 years.

xp842y ago

I’m picturing a long 50’ aisle filled with racks and a guy with a huge box marked “replacement CPUs” and a screwdriver.

Good lord, can you imagine how long just a few of those would take in a data center?

acomjean2y ago

I remember coming to work one morning and having staff at two tables with boxes of RSA keys, and swapping everyones...

(they replaced 40 million of those things..)

https://arstechnica.com/information-technology/2011/06/rsa-f...

lanstin2y ago

The old every CPU is sacred idea lives on.

dspillett2y ago

Also, as a direct user of the CPU, if the fdiv bug would impact you it would affect you often rather than once every three years which is the impact frequency of this fault.

jeroenhd2y ago

These types of bugs have been in hardware forever. Nobody is going to replace hundreds of EPYC servers even if they could get a free replacement from AMD.

znpy2y ago

> today we seem too willing to put up with being sold broken stuff.

maybe someone older than me (i guess somebody in their 50ies or 60ies) could confirm that.

bitwize2y ago

I'm not that old, but I remember seeing bad sector lists as stickers on some hard disks.

I'm not sure if that ever went away, though... I think the IDE firmware in more modern hard disks knew how to redirect bad sectors to good sectors, so the end user never even noticed.

sidewndr462y ago

Again, this is secondhand but from people who worked directly in the industry at the time.

atmavatar2y ago

Don't divide, Intel inside!

Neil442y ago

vegardx2y ago

icybox2y ago

I have C-states already disabled because of old linux kernel bug where the kernel hang on Zen3 architecture. So not much to see here :)

nicolaslem2y ago

Do you mean a bug in an old version of linux that is now fixed? Because I have been using Zen3 and Zen3+ on linux since their release and never had to mess with C-states.

lUserAMD2y ago

Please note, that we are not talking about a core sleeping for three years. We are talking about a core going to deep sleep, when the system has been up for three years or longer.

neilv2y ago

Reminds me of the Intel Atom C2000 series brickings, circa 2017.

https://www.anandtech.com/show/11110/semi-critical-intel-ato...

https://www.servethehome.com/intel-atom-c2000-series-bug-qui...

nh22y ago

I filed a kernel bug 'System thrashes with "AMD-Vi: Completion-Wait loop timed out" after 247 days of uptime'

for AMD Ryzen 7 3700X

https://bugzilla.kernel.org/show_bug.cgi?id=217257

Might this be potentially related?

eqvinox2y ago

MSR-poking Tool for Zen1 Ryzen CPUs to disable C6: https://github.com/r4m0n/ZenStates-Linux/blob/master/zenstat...

tedunangst2y ago

The good news is now I know why my server crashed last month, and it wasn't some other defect.

SpaghettiCthulu2y ago

You've had a Ryzen 7000 series CPU running for nearly 3 years already?

mattpallissard2y ago

This happened with a higher end Cisco switch (the model escapes me) we used in our core many moons ago. Stopped passing traffic completely after a number of days.

At least Cisco told us about it themselves. We just fail-over rebooted until they fixed it.

msla2y ago

Previously:

https://news.ycombinator.com/item?id=28340101 Watch Windows 95 crash live as it exceeds 49.7 days uptime [video]

bushbaba2y ago

In general it is good practice to have machines hard-restart every now and then. Otherwise you run into some weird edge-cases and rely too much on things being up and running 24x7x365

dale_glass2y ago

A machine staying up for almost 3 years is irresponsible in this day and age.

hardware2win2y ago

I dont understand opinons like this

Just because it would be dangerous for your nodejs web_app.exe running on ubuntu behind apache fully exposed on the internet

then there are billion other ways to use computers, like even air gapped systems.

So, dont try to justify obvious flaw

dale_glass2y ago

I mean, hardware is cheap enough that any server of importance should be individually disposable.

Yeah, you can do stuff to maximize uptime but if it needs to stay up that badly you have to consider the case of the hardware needing to be turned off at some point.

> So, dont try to justify obvious flaw

I'm not, it's a bug and should be fixed. But I think if anything is powered for 3 years straight it's a bit concerning.

Otherwise you're liable to find things like that somebody started something by hand 2 years ago, and at a critical moment nobody quite remember what the command was.

PedroBatista2y ago

You live in your own World with other people. Please just keep in mind there are many other Worlds with other people and laws of the Universe.

1 more reply

edude032y ago

As an additional data point -

defrost2y ago

> But I think if anything is powered for 3 years straight it's a bit concerning.

Pretty much why Pawsey has an Annual High Voltage inspection shutdown [1]

> Otherwise you're liable to find things like [..]

[1] https://status.pawsey.org.au/incidents/tk5n5y965r5j

gfvOP2y ago

ly3xqhl8g92y ago

3 years is irresponsible? To quote Logan Roy, you, software developers, "are not serious people" [1]. Just out of curiosity looked for a list of longest running electrical devices [2]:

    1840 - The Oxford Electric Bell
    1871 – Souter Lighthouse in South Shields, UK
    1896 – The Isle of Man’s Manx Electric Railway
    1902 – The Centennial Bulb

[1] https://www.youtube.com/watch?v=LZTaXjt2Ggk

[2] https://www.drax.com/electrification/4-of-the-longest-runnin...

dathinab2y ago

yes 3 years without hardware reset of a component not designed for long term high reliably use is irresponsible (the are servers fir very high reliability, they are just WAY more expensive)

BUT this doesn't mean you need to have downtime, in the same way a train unit in a railway system going through maintenance doesn't mean your railway system has downtime.

Redundancy is a must have feature for reliable systems and that means you system must be able to cope with random hardware failure or rebooting a server unit.

And both planned and unplanned maintenance of components are important normal business which in a well desingned reliable system should not lead to downtime.

Similar testing failure cases is important and should be done.

dale_glass2y ago

Those are completely trivial complexity-wise compared to a modern server, and many don't have a real function, and mostly are artificially maintained as a curiosity.

I mean, the centennial bulb barely glows, that's why it still works. The hotter the filament gets the faster it evaporates, so a light bulb that barely makes any light can stay working forever.

ly3xqhl8g92y ago

Sure, was looking for electrical devices, a better example of what great engineering can achieve I suppose it's the Pons Fabricius [1], bridge built 2,085 years ago, still in use.

[1] https://en.wikipedia.org/wiki/Pons_Fabricius

1 more reply

oynqr2y ago

You don't need full system resets to get security updates. Kexec, live patching, userspace reboot.

cesarb2y ago

You don't need to reboot a machine to update ssh.

kjs32y ago

sokoloff2y ago

Air gapped machines and kernel live patching both exist.

cpach2y ago

And how many people use that? Most servers today are not air-gapped.

agentgumshoe2y ago

How many examples will you need before you say "oh ok, I can see some valid concerns."?

I've worked in places where expensive Lab equipment is running off outdated PCs/servers because updates aren't available and they will absolutely stay on for as long as possible.

We're not all silicon valley, things can be expensive and difficult to replace...

BasedAnon2y ago

I have kernel live patching on my mother's computer because it means she has to know how to do less

j16sdiz2y ago

Most server don't do that, but those that do are not crazy

kuratkull2y ago

Are you perhaps a Windows user? In the Linux world updates don't necessarily require reboots.

dale_glass2y ago

Actually as of late, Linux has been moving towards rebooting for update.

Yeah, you technically can replace on-disk files while services are running.

In practice this can cause trouble if an application wants to read an updated file at the wrong time, and library dependencies can require restarting a lot of stuff.

For ages people would install an update containing a security fix in glibc or libz or something, and keep on running the vulnerable version of the services that use them.

At that point you might as well reboot.

Modern Fedora has a very Windows-like mechanism where you reboot to update. You reboot, the system installs updates, then reboots again.

viraptor2y ago

While Fedora did move towards that, it's not the only way. A lot of systems which require high reliability are built to reload correctly.

Outside of desktops it's not that uncommon to do seamless live reloads of the whole system.

shrimp_emoji2y ago

I reboot after update just superstitiously.

Given what you said, and because I hear hibernation is notoriously buggy on Linux, both superstitions have rewarded me. :D

justinclift2y ago

> Actually as of late, Linux has been moving ...

That's a pretty broad generalisation. Which distro's are you meaning?

1 more reply

arbitrandomuser2y ago

On Arch Linux atleast any external hardware device not already loaded by the kernel will fail to load after a kernel update

gsich2y ago

They do. Kernel and libs require it, unless you want to be unsure if your system is still reboot-safe

shpx2y ago

1042 days ought to be enough for anybody

mhuffman2y ago

Spoken like a true AWS user!

ant6n2y ago

> 1042 days ought to be enough for anybody

"640K ought to be enough for anybody."

Tajnymag2y ago

Not for a server

mihaic2y ago

Not sure if he was sarcastically referencing "640kb should be enough for anything"

mnw21cam2y ago

Kernel bugs are rare. Most (almost every single) vulnerability can be patched without rebooting.

0x02y ago

Denvercoder92y ago

> For example, any library consumed by systemd (such as openssl) usually requires pid1 to relaunch.

That does not require a reboot, `systemctl daemon-reexec` is enough.

1 more reply

j / k navigate · click thread line to collapse