[0] https://docs.hetzner.com/robot/dedicated-server/general-info...
It was very non-obvious to debug since pretty much most emitted metrics, apart from mysterious errors/timeouts to our service, looked reasonable. Even the cpu usage and cpu temperature graphs looked normal since it was a bogus prochot and not actually a real thermal throttling
It kept going to 400mhz.. i suspected throttling and we got it cleaned thermal paste replaced and all that.
Still throttled. We replaced the windows with linux since it was atleast a bit more usable
At the time I didn't know about PROCHOT. And my googling skills clearly weren't sufficient.
One fine day during lunch at a place on campus, Id read about BD_PROCHOT recently. So i wrote a script to msrprobe or whatever it was and disabled it. "Extended" the lifespan of the thing.
very weird behavior, I'd prefer my servers to crash instead of lowering frequency to 400MHz.
I have alerts on PSU's and frequency for this reason.
The servers are so cheap that overcommitting them by double is still significantly cheaper than using cloud hosting, which tends to have the same issue only monitoring it is harder. Though most people using cloud seem to be happy not to know and it's been a known thing that there's a 5x variation between instances of the same size on AWS.: https://www.brendangregg.com/Slides/AWSreInvent2017_performa...
100% agreed. There is nothing worse than a slow server in your fleet. This behavior reeks of "pet" thinking.
It's still the hosting company's responsibility to competently own, maintain, and repair the physical hardware. That includes monitoring. In the old days you had to run a script or install a package to hook into their monitoring....but with IPMI et al being standard they don't need anything from you to do their job.
The only time a hosting company should be hands-off is when they're just providing rack space, power, and data. Anything beyond that is between you and them in a contract/agreement.
Every time I hear Hetzner come up in the last few years it's been a story about them being incompetent. If they're not detecting things like CPU fan failures of their own hardware and they deployed new systems without properly testing them first, then that's just further evidence they're still slipping.
That's one way it can work. There are a great many hosted server options out there from fully managed to fully unmanaged with price points to match. Selling a cheap server under the conditions "call us when it breaks" is a perfectly reasonable offering.
> In the old days you had to run a script or install a package to hook into their monitoring....but with IPMI et al being standard they don't need anything from you to do their job
For dedicated servers, you have to schedule KVM access in advance, so I assume they need to move some hardware and plug into to your server.
This would mean that IPMI is most likely not available or disabled.
If you can't put yourself in the shoes for a second when evaluating a purchase and you just braindead try to make cost go lower and income go higher, your ngmi except in shady sales businesses.
Server hardware is incredibly cheap, if you are somewhat of a competent programmer you can handle most programs in a single server or even a virtual machine. Just give them a little bit of margin and pay 50$/mo instead of 25$/mo, it's not even enough to guarantee they won't go broke or make you a valuable customer, you'll still be banking on whales to make the whole thing profitable.
Also, if your business is in the US, find a US host ffs.
This is really good advice and what I'm following for all systems which need to be stable. If there aren't any security issues, I either wait a few months or keep one or two versions behind.
Windows is a well-known example; people used to wait for a service pack or two before upgrading.
In the wild for example in Forrest, old boars give safety squeaks to send the younglings ahead into a clearing they do not trust. The equivalent to that- would be to write a tech-blog entry that hypes up a technology that is not yet production ready.
Yeah, this is generally a good practice. The silver lining is that our suffering helped uncover the underlying issue faster. :)
This isn’t part of the blog post, but we also considered getting the servers and keeping them idle, without actual customer workload, for about a month in the future. This would be more expensive, but it could help identify potential issues without impacting our users. In our case, the crashes started three weeks after we deployed our first AX162 server, so we need at least a month (or maybe even longer) as a buffer period.
Did you actually uncover the true root cause? Or did they finally uncap the power consumption without telling you, just as they neither confirmed nor denied having limited it?
I don't believe they simply lifted a power cap (if there was one in the first place). I genuinely think the fix came after the motherboard replacements. We had 2 batches of motherboard replacements and after that, the issue disappeared.
If someone from Hetzner is here, maybe they can give extra information.
[1] https://status.hetzner.com/incident/7fae9cca-b38c-4154-8a27-...
> The inspection system as currently used by the Skunk Works, which has been approved by both the Air Force and the Navy, meets the intent of existing military requirements and should be used on new projects. Push more basic inspection responsibility back to the subcontractors and vendors. Don't duplicate so much inspection.
But this will be the only and last time Ubicloud does not burn in a new model, or even tranches of purchases (I also work there...and am a founder).
We recently retired them because we worn down everything on these servers. From RAID cards to power regulators. Rebooting a perfectly running server due to a configuration change and losing the RAID card forever because electron migration erode a trace inside the RAID processor is a sobering experience.
Another surprising name is Huawei. Their servers just don't die.
What are the consequences of power limiting? The article says it can cause hardware to degrade more quickly, why?
Hetzner's lack of response here (and UbiCloud's measurements) seems to suggest they are indeed limiting power, since if they weren't doing it, they'd say so, right?
To check, run `cat /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. It should be `performance`.
If it’s not, set it with `echo performance | sudo tee /sys/devices/system/cpu/cpu
/cpufreq/scaling_governor`. If your workload is cpu hungry this will help. It will revert on startup, so can make it stick, with some cron/systemd or whichever.Of course if you are the one paying for power or it’s your own hardware, make your own judgement for the scaling governor. But if it’s a rented bare metal server, you do want `performance`.
https://www.rvo.nl/onderwerpen/energie-besparen-de-industrie...
If I rent a server I want to be able to run it to the maximum capacity, since I'm paying for all of it. It's dishonest to make me pay for X and give me < X. Idle CPU is wasted money.
The flip side is that the provider should be also offering more climate friendly, lower power options. I'll still want to run them to the max, but the total energy consumed would be less than before.
Also not forgetting that code efficiency matters if we want to get the max results for the minimum carbon spend. Another reason why giant web frameworks and bloated OSes depress me a little.
It can be pretty annoying, because it means that systems can perform better under higher load and that you get drastically different latency depending on whether a request is scheduled on a core that just processed another request (already at high freq) or one that was idle.
And because the frequency control isn't fun enough, this behavior also exists with cpu idle states. Even at high frequency Linux can enter idle states...
I've debugged several cases where this set of issues has caused unintuitive behavior. E.g.
a) switching to a more powerful servers drastically increased latency
b) optimized code resulting in higher latency / lower throughout because that provided enough idle cycles for a deeper idle time between requests
c) slightly increased IO latency leading to significantly worse overall performance, due to the IO getting long though to clock down
Can anyone elaborate on this point? This is counter to my intuition (and in fact, what I saw upon a cursory search), which is that power capping should prolong the useful lifetime of various components.
The only search results I found that claimed otherwise were indicating that if you're running into thermal throttling, then higher operating temperatures can cause components (e.g. capacitors) to degrade faster. But that's expressly not the case in the article, which looked at various temperature sensors.
That said, I'm not an electronics engineer, so my understanding might not be entirely accurate. It’s possible that the degradation was caused by power fluctuations rather than the power cap itself, or perhaps another factor was at play.
[1] https://electronics.stackexchange.com/questions/65837/can-el... [2] https://superuser.com/questions/1202062/what-happens-when-ha...
Volts is as supplied by the utility company.
Amps are monitored per rack and the usual data centre response to going over an amp limit is that a fuse blows or the data centre asks you for more money!
The only way you can decrease power used by a server is by throttling the CPUs.
The normal way of throttling CPUs is via the OS which requires cooperation.
I speculate this is possible via the lights out base band controller (which doesn't need the os to be involved), but I'm pretty sure you'd see that in /sys if it was.
I don't know for sure how the limiting is done, but a simple circuit breaker like the ones we have in our houses would be a simple solution for it. That causes the rack to loose power when the circuit breaks, which is not ideal because you loose the whole rack and affect multiple customers.
Another option would be a current/power limiter[0], which would cause more problems because P = U * I. That would make the voltage (U) drop and then the whole system to be undervolted - weird glitches happen here and it's a common way to bypass various security measures in chips. For example, Raspberry Pi ran this challenge [1] to look for this kind of bugs and test how well their chips can handle attacks, including voltage attacks.
[0] - https://en.m.wikipedia.org/wiki/Current_limiting [1] - https://www.raspberrypi.com/news/security-through-transparen...
No idea what the article is talking about with the damage. Computers like to run slow when possible. There's basically no downside except they take longer to do things.
https://electronics.stackexchange.com/a/65827
> A mosfet needs a certain voltage at its gate to turn fully on. 8V is a typical value. A simple driver circuit could get this voltage directly from the power that also feeds the motor. When this voltage is too low to turn the mosfet fully on a dangerous situation (from the point of view of the moseft) can arise: when it is half-on, both the current through it and the voltage across it can be substantial, resulting in a dissipation that can kill it. Death by undervoltage.
Motherboard issues around power/signaling are a pain to diagnose, they will emerge as all sort of problems apparently related to other components (ram failing to initialize and random restarts are very common in my experience) and you end up swapping everything before actually replacing the MB...
https://docs.hetzner.com/robot/dedicated-server/general-info...
Hetzner is not a normal customer though. As part of their extreme cost optimization they probably buy the cheapest components available and they might even negotiate lower prices in exchange for no warranty. In that case they would have to buy replacement motherboards.
This was something I hadn't heard before, & a surprise to me.
I think it would be amusing if it turns out they just raised the power limits for those servers not showing the problem up to base that was originally advertised.
Sometimes that other company isn't actually very good and you can increase value by insourcing their part of your operation. But you can't assume that is always the case. It wouldn't have solved this particular problem - I think we can safely guess that your chance of getting a batch of faulty motherboards is at least as high as Hetzner's chance.
I think the website said they recently raised 16 million euros (or dollars).
Making investments into data centers and hardware could burn through that really quick in addition to needing more engineers.
By using rented servers (and only renting them when a customer signs up) they avoid this problem.
Building and owning an institution that finances, racks, services, networks, and disposes of servers, both takes time and increases the commitment level. Hetzner is month to month, with a fixed overhead for fresh leasing of servers: the set-up fee.
This is a lot to administer when also building a software institution, and a business. It was not certain at the outset, for example, that the GitHub Actions Runner product would be as popular as it became. In its earliest form, it was partially an engineering test for our virtual machines, and we went around asking friendly contacts that we knew would report abnormalities to use it. There's another universe where it only went as far as an engineering test, and our utilization and revenue pattern (that is, utility to other people) is different.
> In the days that followed, the crash frequency increased.
I don't find the article conclusive whether they would still call them reliable.
Since they don't do any sort of monitoring on their bare metal servers at all, at least insofar as I can tell having been a customer of theirs for ten years, you don't know there's as problem until there's a problem, or unless you've got your own monitoring solution in place.
Back in 2006 my coworker claimed he was the person responsible for them adding a "exchange my dead HDD" menu point on the support site because he wrote one of those tickets per week.
When I got a physical server, the HDD died in the first 48h, so I've not exactly forgiven them, even if this was a tragic story over the last 18 or so years...
On the other hand, I've been recommending their cloud vps for a couple of years because unlike with their HW, I've never had problems.
https://www.reddit.com/r/aws/comments/131v8md/beware_of_brok...
There are also others, but Hetzner is under discussion here.
You don’t get root access, but you do get a preinstalled LAMP stack and a web UI for management.
YOU do the monitoring.
YOU do the troubleshooting.
YOU etc., etc.
If that doesn't appeal to you, or if you don't have the requisite knowledge, which I admit is fairly broad and encompassing, then it's not for you. For those of you that meet those checkboxes, they're a pretty amazing deal.
Where else could I get a 4c/8t CPU with 32 GB of RAM and four (4) 6TB disks for $38 a month? I really don't know of many places with that much hardware for that little cost. And yes, it's an Intel i7-3770, but I don't care. It's still a hell of a lot of hardware for not much price.