Debugging Hetzner: Uncovering failures with powerstat, sensors, and dmidecode (opens in new tab)

(ubicloud.com)

345 pointsngalstyan41y ago108 comments

108 comments

Most other AX models (AX42, AX52 and AX102) also have serious reliability issues, where they will fail after some months. They are based on a faulty motherboard. Hetzner has to replace most, if not all, motherboards for servers built before a certain date over the next 12 months [0]

[0] https://docs.hetzner.com/robot/dedicated-server/general-info...

babuskov1y ago

I have two AX42's. One has been stable since I got it during the Eurocup discount period. The other got replaced 2 times so far, but it looks like the latest replacement is holding up. So, it's like 50% failure rate based on my small sample. I guess only Hetzner and ASRock know the real numbers.

jonatron1y ago

At a previous company, devops would regularly find CPU fan failures on Hetzner. That's in addition to the usual expected HD/SSD failures. You've got to do your own monitoring, it's one of the reasons why unmanaged servers are cheaper than cloud instances.

jeffbee1y ago

I regularly find broken thermal solutions in azure and when I worked at Google it was also a low-level but constant irritant. When I joined Dropbox I said to my team on my first day that I could find a machine in their fleet running at 400MHz, and I was right: a bogus redundant PSU controller was asserting PROCHOT. These things happen whenever you have a lot of machines.

radicality1y ago

The term PROCHOT just brought me back to vivid memories of debugging exactly that at Facebook a while ago.

It was very non-obvious to debug since pretty much most emitted metrics, apart from mysterious errors/timeouts to our service, looked reasonable. Even the cpu usage and cpu temperature graphs looked normal since it was a bogus prochot and not actually a real thermal throttling

1 more reply

bityard1y ago

A laptop that I had would assert PROCHOT if it didn't like the power supply you plugged into it. It actually took an embarrassing amount of time for me to notice that this is what was causing Slack to be inexplicably slower at my desk than when I was out working in a common area in the building.

tryauuum1y ago

in my (limited) experience this only happened with GIGABYTE servers

very weird behavior, I'd prefer my servers to crash instead of lowering frequency to 400MHz.

3 more replies

KennyBlanken1y ago

No? Maybe you cloud kids don't know how this stuff works, but unmanaged just means you get silicon-level access and remote KVM.

It's still the hosting company's responsibility to competently own, maintain, and repair the physical hardware. That includes monitoring. In the old days you had to run a script or install a package to hook into their monitoring....but with IPMI et al being standard they don't need anything from you to do their job.

The only time a hosting company should be hands-off is when they're just providing rack space, power, and data. Anything beyond that is between you and them in a contract/agreement.

Every time I hear Hetzner come up in the last few years it's been a story about them being incompetent. If they're not detecting things like CPU fan failures of their own hardware and they deployed new systems without properly testing them first, then that's just further evidence they're still slipping.

marcusb1y ago

> No? Maybe you cloud kids don't know how this stuff works, but unmanaged just means you get silicon-level access and remote KVM.

That's one way it can work. There are a great many hosted server options out there from fully managed to fully unmanaged with price points to match. Selling a cheap server under the conditions "call us when it breaks" is a perfectly reasonable offering.

cuu5081y ago

Alright, let's say the hosting company has an out-of-band mechanism for detecting reboots. How do they know if the reboots are abnormal (like in this case) or normal, customer-ordered reboots after software upgrades?

1 more reply

babuskov1y ago

Do Hetzner servers even run IPMI?

For dedicated servers, you have to schedule KVM access in advance, so I assume they need to move some hardware and plug into to your server.

This would mean that IPMI is most likely not available or disabled.

1 more reply

TZubiri1y ago

I'm heavily against both relying on free dependencies and going for the cheapest option.

If you can't put yourself in the shoes for a second when evaluating a purchase and you just braindead try to make cost go lower and income go higher, your ngmi except in shady sales businesses.

Server hardware is incredibly cheap, if you are somewhat of a competent programmer you can handle most programs in a single server or even a virtual machine. Just give them a little bit of margin and pay 50$/mo instead of 25$/mo, it's not even enough to guarantee they won't go broke or make you a valuable customer, you'll still be banking on whales to make the whole thing profitable.

Also, if your business is in the US, find a US host ffs.

V__1y ago

> Looking back, waiting six months could have helped us avoid many issues. Early adopters usually find problems that get fixed later.

This is really good advice and what I'm following for all systems which need to be stable. If there aren't any security issues, I either wait a few months or keep one or two versions behind.

esafak1y ago

GitHub is looking to add this feature to dependabot: https://github.com/dependabot/dependabot-core/issues/3651

TZubiri1y ago

Being so deep into dependencies that you have to find more dependencies and features to make your dependency less of a clusterfuck is sad.

1 more reply

h1fra1y ago

In theory, that works in practice nope. You get a random update with a possible bug inside that is only fixed by a new version that you won't get until later. The other strategy is to wait for a package to be fully stable (no update), and in that case, some packages that receive daily/weekly updates are never updated

1 more reply

InDubioProRubio1y ago

This is a wildly successfully pattern in nature, the old using the young and inexperienced, as enthusiastic test units.

In the wild for example in Forrest, old boars give safety squeaks to send the younglings ahead into a clearing they do not trust. The equivalent to that- would be to write a tech-blog entry that hypes up a technology that is not yet production ready.

Tzela1y ago

Just for curiosity: do you have a source?

pwmtr1y ago

Author of the blog post here.

Yeah, this is generally a good practice. The silver lining is that our suffering helped uncover the underlying issue faster. :)

This isn’t part of the blog post, but we also considered getting the servers and keeping them idle, without actual customer workload, for about a month in the future. This would be more expensive, but it could help identify potential issues without impacting our users. In our case, the crashes started three weeks after we deployed our first AX162 server, so we need at least a month (or maybe even longer) as a buffer period.

ThePowerOfFuet1y ago

>The silver lining is that our suffering helped uncover the underlying issue faster.

Did you actually uncover the true root cause? Or did they finally uncap the power consumption without telling you, just as they neither confirmed nor denied having limited it?

2 more replies

axus1y ago

Customers are the best QA. And they pay you too, instead of the reverse!

1 more reply

crishoj1y ago

Were you able to identify the manufacturer and model/revision of the failing motherboards? This would be extremely helpful when shopping for seconds hand servers.

1 more reply

fdr1y ago

It varies by system. As the legendary (to some) Kelly Johnson of the Skunk Works had as one of his main rules:

> The inspection system as currently used by the Skunk Works, which has been approved by both the Air Force and the Navy, meets the intent of existing military requirements and should be used on new projects. Push more basic inspection responsibility back to the subcontractors and vendors. Don't duplicate so much inspection.

But this will be the only and last time Ubicloud does not burn in a new model, or even tranches of purchases (I also work there...and am a founder).

bayindirh1y ago

Dell has this problem sometimes. I remember getting the first batch one of their older servers when they were new. We had to replace motherboards' I/O (rear) section because the servers lost some devices on that part (e.g.: Ethernet controllers, iDRAC, sometimes BIOS) for some time. After shaking out these problems, they ran for almost a decade.

We recently retired them because we worn down everything on these servers. From RAID cards to power regulators. Rebooting a perfectly running server due to a configuration change and losing the RAID card forever because electron migration erode a trace inside the RAID processor is a sobering experience.

merb1y ago

Dell has tons of issues. A faulty mini board of the front led can actually stop the server from booting/running at all (even drac will be dead)

bayindirh1y ago

Interesting. From my experience, Dell is generally one of the least problematic brands when compared in large numbers.

Another surprising name is Huawei. Their servers just don't die.

1 more reply

andai1y ago

> Hetzner didn’t confirm or deny the possibility of power limiting

What are the consequences of power limiting? The article says it can cause hardware to degrade more quickly, why?

Hetzner's lack of response here (and UbiCloud's measurements) seems to suggest they are indeed limiting power, since if they weren't doing it, they'd say so, right?

radicality1y ago

Related and perhaps useful: I’ve seen this in multiple cloud offerings already, where the cpu scaling governor is set to some eco-friendly value, in benefit to the cloud provider and in zero benefit to you and much reduced peak cpu perf.

To check, run `cat /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. It should be `performance`.

If it’s not, set it with `echo performance | sudo tee /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. If your workload is cpu hungry this will help. It will revert on startup, so can make it stick, with some cron/systemd or whichever.

Of course if you are the one paying for power or it’s your own hardware, make your own judgement for the scaling governor. But if it’s a rented bare metal server, you do want `performance`.

Tijdreiziger1y ago

However, eco-friendly power modes can reduce electricity usage, so they can be friendlier for our climate.

https://www.rvo.nl/onderwerpen/energie-besparen-de-industrie...

2 more replies

PinkSheep1y ago

You can tune the ondemand (or any other) governor first to ramp up faster and clock down slower. "performance" should be seen as the nuclear option.

1 more reply

chpatrick1y ago

Is there any downside to ondemand? If your servers aren't running at 100% then there's no point wasting watts, even if you aren't paying for them, right?

1 more reply

vitus1y ago

> To increase the number of machines under power constraints, data center operators usually cap power use per machine. However, this can cause motherboards to degrade more quickly.

Can anyone elaborate on this point? This is counter to my intuition (and in fact, what I saw upon a cursory search), which is that power capping should prolong the useful lifetime of various components.

The only search results I found that claimed otherwise were indicating that if you're running into thermal throttling, then higher operating temperatures can cause components (e.g. capacitors) to degrade faster. But that's expressly not the case in the article, which looked at various temperature sensors.

pwmtr1y ago

At the time of our investigation, we found few articles supporting that power caps could potentially cause hardware degradation, though I don't have the exact sources at hand. I see the child comment shared one example, and after some searching, I found a few more sources [1], [2].

That said, I'm not an electronics engineer, so my understanding might not be entirely accurate. It’s possible that the degradation was caused by power fluctuations rather than the power cap itself, or perhaps another factor was at play.

[1] https://electronics.stackexchange.com/questions/65837/can-el... [2] https://superuser.com/questions/1202062/what-happens-when-ha...

immibis1y ago

The power used by a computer isn't limited by giving it less voltage/current than it should have - if it was, the CPU would crash almost immediately. It's done by reducing the CPU's clock rate until the power it naturally consumes is less than the power limit.

nickcw1y ago

Power = volts * amps

Volts is as supplied by the utility company.

Amps are monitored per rack and the usual data centre response to going over an amp limit is that a fuse blows or the data centre asks you for more money!

The only way you can decrease power used by a server is by throttling the CPUs.

The normal way of throttling CPUs is via the OS which requires cooperation.

I speculate this is possible via the lights out base band controller (which doesn't need the os to be involved), but I'm pretty sure you'd see that in /sys if it was.

tecleandor1y ago

Yep, that's weird, I've always read that high power/temp can degrade electronics way faster. Any EE can shed a light here?

avian1y ago

As an electronics engineer I have no idea what the author is talking about here and was about to post the same question.

redleader551y ago

Every rack in a data center has a power budget, which is actually constrained by how much heat the HVAC system can pull out of the DC, rather than how much power is available. Nevertheless it is limited per rack to ensure a few high power servers don't bring down a larger portion of the DC.

I don't know for sure how the limiting is done, but a simple circuit breaker like the ones we have in our houses would be a simple solution for it. That causes the rack to loose power when the circuit breaks, which is not ideal because you loose the whole rack and affect multiple customers.

Another option would be a current/power limiter[0], which would cause more problems because P = U * I. That would make the voltage (U) drop and then the whole system to be undervolted - weird glitches happen here and it's a common way to bypass various security measures in chips. For example, Raspberry Pi ran this challenge [1] to look for this kind of bugs and test how well their chips can handle attacks, including voltage attacks.

[0] - https://en.m.wikipedia.org/wiki/Current_limiting [1] - https://www.raspberrypi.com/news/security-through-transparen...

immibis1y ago

Computers implement power limits by reducing their own speed until their power consumption falls under the limit. There's no risk of damage and it should actually extend the lifetime due to less heat, as well as increasing the efficiency (computation per watt).

No idea what the article is talking about with the damage. Computers like to run slow when possible. There's basically no downside except they take longer to do things.

cibyr1y ago

One possibility is that at lower power settings, the CPUs don't get as hot, which means the fans don't spin up as much, which can mean that other components also get less airflow and then get hotter than they would otherwise. The fix for this is usually to monitor the temperature of those other components and include that as an input to the fan speed algorithm. No idea if that's what's actually going on here though.

wmf1y ago

Expert in server power management here. Your intuition is right and the comments/links to the contrary are wrong. Undervolting is unreliable but let's be clear: no one is undervolting servers. I don't even know if it's possible. Power limiting (e.g. RAPL) is completely safe to use because it keeps voltage, frequency, temperature, fan speed, etc within safe bounds.

OptionOfT1y ago

The only place I could find some answer that sheds some light was StackOverflow:

https://electronics.stackexchange.com/a/65827

> A mosfet needs a certain voltage at its gate to turn fully on. 8V is a typical value. A simple driver circuit could get this voltage directly from the power that also feeds the motor. When this voltage is too low to turn the mosfet fully on a dangerous situation (from the point of view of the moseft) can arise: when it is half-on, both the current through it and the voltage across it can be substantial, resulting in a dissipation that can kill it. Death by undervoltage.

chronid1y ago

We will never know, but I wonder if it could be a power/signaling or VRM issue - the CPU non getting hot doesn't mean something else on the board has gone out of spec and into catastrophic failure.

Motherboard issues around power/signaling are a pain to diagnose, they will emerge as all sort of problems apparently related to other components (ram failing to initialize and random restarts are very common in my experience) and you end up swapping everything before actually replacing the MB...

rikafurude211y ago

Similar thing happened to a AX102 I currently use, something related the network card which caused crashes. Thankfully hetzner support was helpful with replacement hardware. caused quite some grief but at least it was a good lesson in hardware troubleshooting. Worth it to me personally

yread1y ago

Yep same here. AX102 crashes with almost no load, nothing in the logs, won't come on. Hetzner looked at it multiple times and found either nothing or replaced cpu paste or a PSU connector. I migrated to AX162 and so far so good

jaigupta1y ago

Same here. Hetzner found no issues with hardware in diagnostics, they insisted it is related to OS/Software side but on my request they changed hardware which fixed issue.

1 more reply

urbandw311er1y ago

Would anybody with data center experience be able to hazard a guess on what type of commercial resolution Hetzner would have reached with the Motherboard supplier here? Would we assume all mobos replaced free of charge plus compensation?

wmf1y ago

When you buy name-brand servers you'll definitely get any faulty hardware replaced. Compensation would only happen if you negotiated for that and you'd have to pay extra. You're probably better off buying some kind of business interruption insurance instead of trying to get vendors to pay you for downtime (even if it is their fault).

Hetzner is not a normal customer though. As part of their extreme cost optimization they probably buy the cheapest components available and they might even negotiate lower prices in exchange for no warranty. In that case they would have to buy replacement motherboards.

babuskov1y ago

I think they probably got a batch of these really cheap in the first place, because those servers were offered without the setup fee initially. It was during the soccer World Cup in Germany.

jauntywundrkind1y ago

> To increase the number of machines under power constraints, data center operators usually cap power use per machine. However, this can cause motherboards to degrade more quickly.

This was something I hadn't heard before, & a surprise to me.

scottcha1y ago

I’d like to see what cpu governor is running on those systems before assuming a power cap is in place. Lots of defaults installs of Linux ship with the power save governor running which is going to limit your max frequencies and through that the max power you can hit.

__m1y ago

schedutil on mine scheduled for mainboard replacement

trod12341y ago

It would have been nice if they linked to the power metrics for the new servers.

I think it would be amusing if it turns out they just raised the power limits for those servers not showing the problem up to base that was originally advertised.

vednig1y ago

as a CI/CD provider wouldn't it benefit if Ubicloud had their own servers?

immibis1y ago

Depends how many they need and how much control. Do they want to be a server company or an adapting-servers-to-run-your-CI/CD company or both? You can extract value from both parts of the equation, but theoretical economics tells us you can get the most value for the least effort by doing more of what you're best at and paying someone else to do what they're best at, rather than doing everything mediocrely yourself.

Sometimes that other company isn't actually very good and you can increase value by insourcing their part of your operation. But you can't assume that is always the case. It wouldn't have solved this particular problem - I think we can safely guess that your chance of getting a batch of faulty motherboards is at least as high as Hetzner's chance.

eitland1y ago

They are in the early stages.

I think the website said they recently raised 16 million euros (or dollars).

Making investments into data centers and hardware could burn through that really quick in addition to needing more engineers.

By using rented servers (and only renting them when a customer signs up) they avoid this problem.

vednig1y ago

understood, would love to know about it from founders tho, and what went through in their decision

1 more reply

wink1y ago

> One of the providers we like is Hetzner because of their affordable and reliable servers.

> In the days that followed, the crash frequency increased.

I don't find the article conclusive whether they would still call them reliable.

cbozeman1y ago

Hetzner's reliable... until they aren't.

Since they don't do any sort of monitoring on their bare metal servers at all, at least insofar as I can tell having been a customer of theirs for ten years, you don't know there's as problem until there's a problem, or unless you've got your own monitoring solution in place.

wink1y ago

Back in 2012 it regularly happened that we called them because the network was gone, because our monitoring seemed to be better. Or at least quicker than what they showed.

Back in 2006 my coworker claimed he was the person responsible for them adding a "exchange my dead HDD" menu point on the support site because he wrote one of those tickets per week.

When I got a physical server, the HDD died in the first 48h, so I've not exactly forgiven them, even if this was a tragic story over the last 18 or so years...

On the other hand, I've been recommending their cloud vps for a couple of years because unlike with their HW, I've never had problems.

immibis1y ago

Seems like this problem, was unforeseeable, is isolated to a particular current-generation/model of server motherboards (AX2), and doesn't usually happen. I had an AX41* previously with no such problem, so it's not all AXes, just all current-generation AXes (which is all of the AXes they give to new customers, so that's no consolation).

aduffy1y ago

To their credit they actually fixed the problem. Good luck getting this level of support from any of the big 3 public cloud providers.

frenchtoast81y ago

For example, AWS's Mac machines frequently run into hardware failures. My current job runs a measly 5 mac1.metal hosts for internal testing, and we experience hardware failures on these machines a few times a year. Doesn't sound like a lot, but these machines are almost always completely idle, and we almost never get host failures for Linux hosts. To make matters worse, sometimes a brand new instance needs replacement before it even comes up for the first time, which is annoying because you are billed a minimum of 24 hours for these instances. People have been complaining about this for years and seemingly nothing is being done about it.

https://www.reddit.com/r/aws/comments/131v8md/beware_of_brok...

janc_1y ago

The main difference being that you talk with real humans who try to help you, not computer programs designed to give you an illusion…

1 more reply

dangoodmanUT1y ago

is there a provider that's like bare metal, but would detect these kinds of things mostly automatic? E.g. faulty or constantly crashing hardware.

greggyb1y ago

Managed servers: https://www.hetzner.com/managed-server/

There are also others, but Hetzner is under discussion here.

Tijdreiziger1y ago

Managed servers are quite a different product, closer to ‘old-school’ shared webhosting.

You don’t get root access, but you do get a preinstalled LAMP stack and a web UI for management.

gtirloni1y ago

Anyone got experience with Ubicloud's OpenStack stack?

fdr1y ago

Ubicloud does not have an OpenStack dependency.

gtirloni1y ago

Thanks, I was under the impression it did but re-reading the posts I see it's not the case.

indulona1y ago

i am so glad my sign up process with hetzner failed when i was so dumb that i wanted to give them a chance even with the internet full of horrific stories of bad experiences from their customers. lucky me.

cbozeman1y ago

Hetzner is fine for what it is, you just need to know that it's all on you and only YOU.

YOU do the monitoring.

YOU do the troubleshooting.

YOU etc., etc.

If that doesn't appeal to you, or if you don't have the requisite knowledge, which I admit is fairly broad and encompassing, then it's not for you. For those of you that meet those checkboxes, they're a pretty amazing deal.

Where else could I get a 4c/8t CPU with 32 GB of RAM and four (4) 6TB disks for $38 a month? I really don't know of many places with that much hardware for that little cost. And yes, it's an Intel i7-3770, but I don't care. It's still a hell of a lot of hardware for not much price.

jaigupta1y ago

We had been colocating servers from decades but there is too much "YOU", compared to that we find Hetzner doing a lot for us (hardware inventory, replacement, remote hands, networking etc). We are slowly moving away from colocating to renting at Hetzner. It is so much better.

j / k navigate · click thread line to collapse

108 comments

nik7361y ago

[0] https://docs.hetzner.com/robot/dedicated-server/general-info...

babuskov1y ago

jonatron1y ago

jeffbee1y ago

radicality1y ago

The term PROCHOT just brought me back to vivid memories of debugging exactly that at Facebook a while ago.

1 more reply

bityard1y ago

tryauuum1y ago

in my (limited) experience this only happened with GIGABYTE servers

very weird behavior, I'd prefer my servers to crash instead of lowering frequency to 400MHz.

3 more replies

KennyBlanken1y ago

No? Maybe you cloud kids don't know how this stuff works, but unmanaged just means you get silicon-level access and remote KVM.

The only time a hosting company should be hands-off is when they're just providing rack space, power, and data. Anything beyond that is between you and them in a contract/agreement.

marcusb1y ago

> No? Maybe you cloud kids don't know how this stuff works, but unmanaged just means you get silicon-level access and remote KVM.

cuu5081y ago

1 more reply

babuskov1y ago

Do Hetzner servers even run IPMI?

For dedicated servers, you have to schedule KVM access in advance, so I assume they need to move some hardware and plug into to your server.

This would mean that IPMI is most likely not available or disabled.

1 more reply

TZubiri1y ago

I'm heavily against both relying on free dependencies and going for the cheapest option.

If you can't put yourself in the shoes for a second when evaluating a purchase and you just braindead try to make cost go lower and income go higher, your ngmi except in shady sales businesses.

Also, if your business is in the US, find a US host ffs.

V__1y ago

> Looking back, waiting six months could have helped us avoid many issues. Early adopters usually find problems that get fixed later.

This is really good advice and what I'm following for all systems which need to be stable. If there aren't any security issues, I either wait a few months or keep one or two versions behind.

esafak1y ago

GitHub is looking to add this feature to dependabot: https://github.com/dependabot/dependabot-core/issues/3651

TZubiri1y ago

Being so deep into dependencies that you have to find more dependencies and features to make your dependency less of a clusterfuck is sad.

1 more reply

h1fra1y ago

1 more reply

InDubioProRubio1y ago

This is a wildly successfully pattern in nature, the old using the young and inexperienced, as enthusiastic test units.

Tzela1y ago

Just for curiosity: do you have a source?

pwmtr1y ago

Author of the blog post here.

Yeah, this is generally a good practice. The silver lining is that our suffering helped uncover the underlying issue faster. :)

ThePowerOfFuet1y ago

>The silver lining is that our suffering helped uncover the underlying issue faster.

Did you actually uncover the true root cause? Or did they finally uncap the power consumption without telling you, just as they neither confirmed nor denied having limited it?

2 more replies

axus1y ago

Customers are the best QA. And they pay you too, instead of the reverse!

1 more reply

crishoj1y ago

Were you able to identify the manufacturer and model/revision of the failing motherboards? This would be extremely helpful when shopping for seconds hand servers.

1 more reply

fdr1y ago

It varies by system. As the legendary (to some) Kelly Johnson of the Skunk Works had as one of his main rules:

But this will be the only and last time Ubicloud does not burn in a new model, or even tranches of purchases (I also work there...and am a founder).

bayindirh1y ago

merb1y ago

Dell has tons of issues. A faulty mini board of the front led can actually stop the server from booting/running at all (even drac will be dead)

bayindirh1y ago

Interesting. From my experience, Dell is generally one of the least problematic brands when compared in large numbers.

Another surprising name is Huawei. Their servers just don't die.

1 more reply

andai1y ago

> Hetzner didn’t confirm or deny the possibility of power limiting

What are the consequences of power limiting? The article says it can cause hardware to degrade more quickly, why?

Hetzner's lack of response here (and UbiCloud's measurements) seems to suggest they are indeed limiting power, since if they weren't doing it, they'd say so, right?

radicality1y ago

To check, run `cat /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. It should be `performance`.

Of course if you are the one paying for power or it’s your own hardware, make your own judgement for the scaling governor. But if it’s a rented bare metal server, you do want `performance`.

Tijdreiziger1y ago

However, eco-friendly power modes can reduce electricity usage, so they can be friendlier for our climate.

https://www.rvo.nl/onderwerpen/energie-besparen-de-industrie...

2 more replies

PinkSheep1y ago

You can tune the ondemand (or any other) governor first to ramp up faster and clock down slower. "performance" should be seen as the nuclear option.

1 more reply

chpatrick1y ago

Is there any downside to ondemand? If your servers aren't running at 100% then there's no point wasting watts, even if you aren't paying for them, right?

1 more reply

vitus1y ago

> To increase the number of machines under power constraints, data center operators usually cap power use per machine. However, this can cause motherboards to degrade more quickly.

pwmtr1y ago

[1] https://electronics.stackexchange.com/questions/65837/can-el... [2] https://superuser.com/questions/1202062/what-happens-when-ha...

immibis1y ago

nickcw1y ago

Power = volts * amps

Volts is as supplied by the utility company.

Amps are monitored per rack and the usual data centre response to going over an amp limit is that a fuse blows or the data centre asks you for more money!

The only way you can decrease power used by a server is by throttling the CPUs.

The normal way of throttling CPUs is via the OS which requires cooperation.

I speculate this is possible via the lights out base band controller (which doesn't need the os to be involved), but I'm pretty sure you'd see that in /sys if it was.

tecleandor1y ago

Yep, that's weird, I've always read that high power/temp can degrade electronics way faster. Any EE can shed a light here?

avian1y ago

As an electronics engineer I have no idea what the author is talking about here and was about to post the same question.

redleader551y ago

[0] - https://en.m.wikipedia.org/wiki/Current_limiting [1] - https://www.raspberrypi.com/news/security-through-transparen...

immibis1y ago

No idea what the article is talking about with the damage. Computers like to run slow when possible. There's basically no downside except they take longer to do things.

cibyr1y ago

wmf1y ago

OptionOfT1y ago

The only place I could find some answer that sheds some light was StackOverflow:

https://electronics.stackexchange.com/a/65827

chronid1y ago

We will never know, but I wonder if it could be a power/signaling or VRM issue - the CPU non getting hot doesn't mean something else on the board has gone out of spec and into catastrophic failure.

rikafurude211y ago

yread1y ago

jaigupta1y ago

Same here. Hetzner found no issues with hardware in diagnostics, they insisted it is related to OS/Software side but on my request they changed hardware which fixed issue.

1 more reply

urbandw311er1y ago

wmf1y ago

babuskov1y ago

I think they probably got a batch of these really cheap in the first place, because those servers were offered without the setup fee initially. It was during the soccer World Cup in Germany.

jauntywundrkind1y ago

> To increase the number of machines under power constraints, data center operators usually cap power use per machine. However, this can cause motherboards to degrade more quickly.

This was something I hadn't heard before, & a surprise to me.

scottcha1y ago

__m1y ago

schedutil on mine scheduled for mainboard replacement

trod12341y ago

It would have been nice if they linked to the power metrics for the new servers.

I think it would be amusing if it turns out they just raised the power limits for those servers not showing the problem up to base that was originally advertised.

vednig1y ago

as a CI/CD provider wouldn't it benefit if Ubicloud had their own servers?

immibis1y ago

eitland1y ago

They are in the early stages.

I think the website said they recently raised 16 million euros (or dollars).

Making investments into data centers and hardware could burn through that really quick in addition to needing more engineers.

By using rented servers (and only renting them when a customer signs up) they avoid this problem.

vednig1y ago

understood, would love to know about it from founders tho, and what went through in their decision

1 more reply

wink1y ago

> One of the providers we like is Hetzner because of their affordable and reliable servers.

> In the days that followed, the crash frequency increased.

I don't find the article conclusive whether they would still call them reliable.

cbozeman1y ago

Hetzner's reliable... until they aren't.

wink1y ago

Back in 2012 it regularly happened that we called them because the network was gone, because our monitoring seemed to be better. Or at least quicker than what they showed.

Back in 2006 my coworker claimed he was the person responsible for them adding a "exchange my dead HDD" menu point on the support site because he wrote one of those tickets per week.

When I got a physical server, the HDD died in the first 48h, so I've not exactly forgiven them, even if this was a tragic story over the last 18 or so years...

On the other hand, I've been recommending their cloud vps for a couple of years because unlike with their HW, I've never had problems.

immibis1y ago

aduffy1y ago

To their credit they actually fixed the problem. Good luck getting this level of support from any of the big 3 public cloud providers.

frenchtoast81y ago

https://www.reddit.com/r/aws/comments/131v8md/beware_of_brok...

janc_1y ago

The main difference being that you talk with real humans who try to help you, not computer programs designed to give you an illusion…

1 more reply

dangoodmanUT1y ago

is there a provider that's like bare metal, but would detect these kinds of things mostly automatic? E.g. faulty or constantly crashing hardware.

greggyb1y ago

Managed servers: https://www.hetzner.com/managed-server/

There are also others, but Hetzner is under discussion here.

Tijdreiziger1y ago

Managed servers are quite a different product, closer to ‘old-school’ shared webhosting.

You don’t get root access, but you do get a preinstalled LAMP stack and a web UI for management.

gtirloni1y ago

Anyone got experience with Ubicloud's OpenStack stack?

fdr1y ago

Ubicloud does not have an OpenStack dependency.

gtirloni1y ago

Thanks, I was under the impression it did but re-reading the posts I see it's not the case.

indulona1y ago

cbozeman1y ago

Hetzner is fine for what it is, you just need to know that it's all on you and only YOU.

YOU do the monitoring.

YOU do the troubleshooting.

YOU etc., etc.

jaigupta1y ago

j / k navigate · click thread line to collapse