[0] https://docs.hetzner.com/robot/dedicated-server/general-info...
It was very non-obvious to debug since pretty much most emitted metrics, apart from mysterious errors/timeouts to our service, looked reasonable. Even the cpu usage and cpu temperature graphs looked normal since it was a bogus prochot and not actually a real thermal throttling
very weird behavior, I'd prefer my servers to crash instead of lowering frequency to 400MHz.
It's still the hosting company's responsibility to competently own, maintain, and repair the physical hardware. That includes monitoring. In the old days you had to run a script or install a package to hook into their monitoring....but with IPMI et al being standard they don't need anything from you to do their job.
The only time a hosting company should be hands-off is when they're just providing rack space, power, and data. Anything beyond that is between you and them in a contract/agreement.
Every time I hear Hetzner come up in the last few years it's been a story about them being incompetent. If they're not detecting things like CPU fan failures of their own hardware and they deployed new systems without properly testing them first, then that's just further evidence they're still slipping.
That's one way it can work. There are a great many hosted server options out there from fully managed to fully unmanaged with price points to match. Selling a cheap server under the conditions "call us when it breaks" is a perfectly reasonable offering.
For dedicated servers, you have to schedule KVM access in advance, so I assume they need to move some hardware and plug into to your server.
This would mean that IPMI is most likely not available or disabled.
If you can't put yourself in the shoes for a second when evaluating a purchase and you just braindead try to make cost go lower and income go higher, your ngmi except in shady sales businesses.
Server hardware is incredibly cheap, if you are somewhat of a competent programmer you can handle most programs in a single server or even a virtual machine. Just give them a little bit of margin and pay 50$/mo instead of 25$/mo, it's not even enough to guarantee they won't go broke or make you a valuable customer, you'll still be banking on whales to make the whole thing profitable.
Also, if your business is in the US, find a US host ffs.
This is really good advice and what I'm following for all systems which need to be stable. If there aren't any security issues, I either wait a few months or keep one or two versions behind.
In the wild for example in Forrest, old boars give safety squeaks to send the younglings ahead into a clearing they do not trust. The equivalent to that- would be to write a tech-blog entry that hypes up a technology that is not yet production ready.
Yeah, this is generally a good practice. The silver lining is that our suffering helped uncover the underlying issue faster. :)
This isn’t part of the blog post, but we also considered getting the servers and keeping them idle, without actual customer workload, for about a month in the future. This would be more expensive, but it could help identify potential issues without impacting our users. In our case, the crashes started three weeks after we deployed our first AX162 server, so we need at least a month (or maybe even longer) as a buffer period.
Did you actually uncover the true root cause? Or did they finally uncap the power consumption without telling you, just as they neither confirmed nor denied having limited it?
> The inspection system as currently used by the Skunk Works, which has been approved by both the Air Force and the Navy, meets the intent of existing military requirements and should be used on new projects. Push more basic inspection responsibility back to the subcontractors and vendors. Don't duplicate so much inspection.
But this will be the only and last time Ubicloud does not burn in a new model, or even tranches of purchases (I also work there...and am a founder).
We recently retired them because we worn down everything on these servers. From RAID cards to power regulators. Rebooting a perfectly running server due to a configuration change and losing the RAID card forever because electron migration erode a trace inside the RAID processor is a sobering experience.
Another surprising name is Huawei. Their servers just don't die.
What are the consequences of power limiting? The article says it can cause hardware to degrade more quickly, why?
Hetzner's lack of response here (and UbiCloud's measurements) seems to suggest they are indeed limiting power, since if they weren't doing it, they'd say so, right?
To check, run `cat /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. It should be `performance`.
If it’s not, set it with `echo performance | sudo tee /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. If your workload is cpu hungry this will help. It will revert on startup, so can make it stick, with some cron/systemd or whichever.
Of course if you are the one paying for power or it’s your own hardware, make your own judgement for the scaling governor. But if it’s a rented bare metal server, you do want `performance`.
https://www.rvo.nl/onderwerpen/energie-besparen-de-industrie...
Can anyone elaborate on this point? This is counter to my intuition (and in fact, what I saw upon a cursory search), which is that power capping should prolong the useful lifetime of various components.
The only search results I found that claimed otherwise were indicating that if you're running into thermal throttling, then higher operating temperatures can cause components (e.g. capacitors) to degrade faster. But that's expressly not the case in the article, which looked at various temperature sensors.
That said, I'm not an electronics engineer, so my understanding might not be entirely accurate. It’s possible that the degradation was caused by power fluctuations rather than the power cap itself, or perhaps another factor was at play.
[1] https://electronics.stackexchange.com/questions/65837/can-el... [2] https://superuser.com/questions/1202062/what-happens-when-ha...
Volts is as supplied by the utility company.
Amps are monitored per rack and the usual data centre response to going over an amp limit is that a fuse blows or the data centre asks you for more money!
The only way you can decrease power used by a server is by throttling the CPUs.
The normal way of throttling CPUs is via the OS which requires cooperation.
I speculate this is possible via the lights out base band controller (which doesn't need the os to be involved), but I'm pretty sure you'd see that in /sys if it was.
I don't know for sure how the limiting is done, but a simple circuit breaker like the ones we have in our houses would be a simple solution for it. That causes the rack to loose power when the circuit breaks, which is not ideal because you loose the whole rack and affect multiple customers.
Another option would be a current/power limiter[0], which would cause more problems because P = U * I. That would make the voltage (U) drop and then the whole system to be undervolted - weird glitches happen here and it's a common way to bypass various security measures in chips. For example, Raspberry Pi ran this challenge [1] to look for this kind of bugs and test how well their chips can handle attacks, including voltage attacks.
[0] - https://en.m.wikipedia.org/wiki/Current_limiting [1] - https://www.raspberrypi.com/news/security-through-transparen...
No idea what the article is talking about with the damage. Computers like to run slow when possible. There's basically no downside except they take longer to do things.
https://electronics.stackexchange.com/a/65827
> A mosfet needs a certain voltage at its gate to turn fully on. 8V is a typical value. A simple driver circuit could get this voltage directly from the power that also feeds the motor. When this voltage is too low to turn the mosfet fully on a dangerous situation (from the point of view of the moseft) can arise: when it is half-on, both the current through it and the voltage across it can be substantial, resulting in a dissipation that can kill it. Death by undervoltage.
Motherboard issues around power/signaling are a pain to diagnose, they will emerge as all sort of problems apparently related to other components (ram failing to initialize and random restarts are very common in my experience) and you end up swapping everything before actually replacing the MB...
Hetzner is not a normal customer though. As part of their extreme cost optimization they probably buy the cheapest components available and they might even negotiate lower prices in exchange for no warranty. In that case they would have to buy replacement motherboards.
This was something I hadn't heard before, & a surprise to me.
I think it would be amusing if it turns out they just raised the power limits for those servers not showing the problem up to base that was originally advertised.
Sometimes that other company isn't actually very good and you can increase value by insourcing their part of your operation. But you can't assume that is always the case. It wouldn't have solved this particular problem - I think we can safely guess that your chance of getting a batch of faulty motherboards is at least as high as Hetzner's chance.
I think the website said they recently raised 16 million euros (or dollars).
Making investments into data centers and hardware could burn through that really quick in addition to needing more engineers.
By using rented servers (and only renting them when a customer signs up) they avoid this problem.
> In the days that followed, the crash frequency increased.
I don't find the article conclusive whether they would still call them reliable.
Since they don't do any sort of monitoring on their bare metal servers at all, at least insofar as I can tell having been a customer of theirs for ten years, you don't know there's as problem until there's a problem, or unless you've got your own monitoring solution in place.
Back in 2006 my coworker claimed he was the person responsible for them adding a "exchange my dead HDD" menu point on the support site because he wrote one of those tickets per week.
When I got a physical server, the HDD died in the first 48h, so I've not exactly forgiven them, even if this was a tragic story over the last 18 or so years...
On the other hand, I've been recommending their cloud vps for a couple of years because unlike with their HW, I've never had problems.
https://www.reddit.com/r/aws/comments/131v8md/beware_of_brok...
There are also others, but Hetzner is under discussion here.
You don’t get root access, but you do get a preinstalled LAMP stack and a web UI for management.
YOU do the monitoring.
YOU do the troubleshooting.
YOU etc., etc.
If that doesn't appeal to you, or if you don't have the requisite knowledge, which I admit is fairly broad and encompassing, then it's not for you. For those of you that meet those checkboxes, they're a pretty amazing deal.
Where else could I get a 4c/8t CPU with 32 GB of RAM and four (4) 6TB disks for $38 a month? I really don't know of many places with that much hardware for that little cost. And yes, it's an Intel i7-3770, but I don't care. It's still a hell of a lot of hardware for not much price.