IBM Cloud was down, as well as their status page (opens in new tab)

(cloud.ibm.com)

268 pointswhyleym5y ago181 comments

181 comments

My side project StatusGator monitors status pages (including IBM's ill-fated page) and I'm seeing more than 10% of the nearly 800 services we monitor having an outage right now.

So it appears to affect anyone who depends on IBM Cloud.

bberenberg5y ago

I really wonder how people get value out of a meta status page when my experience is that status pages are often incorrect about what the actual status is. Whether they're manually updated, or it's a case of "your 9s are not my 9s", it seems like a compounded broken telephone problem.

pas5y ago

Probaly it's great to have some very big picture overview. Both in scope ("all" the cloud, and both in time as in "all" time, and maybe there's even some value in looking at the correlation of these).

Maybe it helps with doing a sanity check before picking a provider. And, I guess, at a basic level it helps with accountability/transparency.

1 more reply

Jedd5y ago

Big fans of StatusGator here.

Do you have similar %'s of monitored cloud services that have gone off the air during other providers' outages?

colinbartlett5y ago

Thanks and that's a GREAT idea for some detailed analysis. I have been trying to make better use of the 5+ years of status page history stashed in the cupboard.

1 more reply

ComputerGuru5y ago

So what are HNers using IBM Cloud for and where do you see that it has an edge over AWS offerings (where an overlap exists, obviously)?

(I figure either you’re in devops and you are putting out fires too busy to read this thread or you’re not and your work is halted because of the incident so you might have time to read and reply ;)

paranoidrobot5y ago

My previous job used Softlayer heavily.

Two of the biggest advantages were:

Price for hardware. As a base price, their bare-metal gear was significantly cheaper than equivalent-specced AWS gear (if it was even possible to get something like that). We managed to snag quite a few 'interesting' configurations of things at various times that you just couldn't get at all in AWS. Things like PCI SSDs, very large RAM configs, or High-Frequency low-core count CPUs.

Free international/regional transfer. We took significant advantage of this to move data around. We'd replicate TBs of data around.

At various times management and dev teams would complain and say that we should move everything to AWS (or whatever cloud provider they'd just met with at a conference).

We consistently showed higher performance and lower cost by significant margins. On cost alone, we were paying a small fraction of what it'd cost on AWS, even after taking into consideration ways to reduce cost on AWS such as scaling, spot instances and reserved-instances.

jgalt2125y ago

I really would like to see an AWS memo which maps out common use cases and expected costs (selling points) vs actual use cases and actual costs (pain points).

2 more replies

toast05y ago

We used Softlayer (rebranded to IBM Cloud, and affected by this) at my last job. For the most part, their service pretty much just works; clearly not today. :)

We had a couple thousand bare metal servers, and barely used any of their API stuff.

As with any facility, there were occasional issues with electrical transfer switches, core router failures, fiber cuts, etc. Stuff happens, but we got pretty good communication, and things got resolved in a reasonable amount of time. Service got noticeably worse after IBM, but we were already planning to move to our acquirers hosting, because that's what happens when you're acquired. Oh, and their load balancers had garbage uptime.

Bandwidth prices used to be pretty reasonable, but they've adopted AWS style obscene pricing. At least they still let you use the private network for free (including to other datacenters).

dang5y ago

HN ran on a box at Softlayer until early 2018 or so. This makes me think that the title of this post (which was submitted as "IBM Cloud down as well as their status page which looks to be hosted there") could at some point have been "IBM Cloud down as well as their status page which looks to be hosted there as well as the forum where people post these things which also looks to be hosted there".

3 more replies

dsmcr5y ago

You guys were one of the best use cases for the SL model, which really hasn't changed in 10+ years. You had very few dependencies on the less-reliable (read: all of them) services inside the SL stack and mostly managed everything on box and in software. In a few POPs you guys were running about 50% of the total SL backbone bandwidth. There were a lot of sad panda hats when you guys started to transition away.

2 more replies

nixgeek5y ago

Last job for me was also a few thousand bare metal servers at SoftLayer. Acquired and moved to that infrastructure instead. Wonder if its the same acquisition? :-)

dsmcr5y ago

FWIW - IBM Cloud today has basically no benefit over AWS, Azure or GCE or even against some of the smaller regional players like AliCloud. The notable exception would be if you need to run a bare metal solution and leverage their free backbone which is a pretty narrow use case these days. The main selling point previously was to stand up your own VMware environment but even that came with a laundry list of caveats (unsupported hardware, limited VLANs, non-flexible IP space) that made it painful to use. Today AWS is vastly more performant, flexible, reliable and has a bunch of useful services you don't get from IBM Cloud.

nojito5y ago

Price isn’t a benefit?

1 more reply

Xenoamorphous5y ago

If we speak specifically about IBM Cloud vs AWS, we use the Natural Language Understanding API in IBM Cloud and as far as I know the equivalent AWS offering, Comprehend, doesn't provide named entity disambiguation nor links to knowledge graphs (IBM links to DBpedia).

MS and Google do provide those features though.

dsmcr5y ago

Unfortunately, that API changes regularly and often in undocumented ways that causes breakages for customers. Its really a lot of fun to deal with when suddenly a bunch of automation breaks and it turns out an unannounced push fundamentally re-writes foundational API calls.

fieldmarshal5y ago

We have been leasing bare metal servers since the pre-IBM Softlayer days.

Over the past few years we have experienced quite a few network-related outages. Not usually to this extent, more generally a failure of some piece of network gear that takes out either backend or frontend traffic from a particular data center. We seriously priced out a migration to another provider recently, but in the end what held us back was cross-AZ transfer costs on AWS. We found it would raise our operating costs significantly, so the matter was dropped.

We were much happier with the service and support we received prior to the IBM acquisition.

sky_rw5y ago

I had originally signed up due to the availability and pricing of bare metal servers and the mixed Windows/Linux server offerings. Their windows server licensing was better than AWS and I didn't want to be on Azure for a variety of reasons.

Currently on them because we have an OpenVPN based infrastructure that is very challenging to migrate.

Lastly the majority of our customers are in the midwest or Texas, and the proximity of their Dallas DC was a huge performance win for us.

manquer5y ago

Rarely it is a just technical decision, usually money is the reason.

In small and mid size organizations the CSP gave better pricing, or they help with your sales etc

In large organizations - IBM/Oracle bundle their existing products currently being paid for any way, or account managers have great relationships with decision makers , the company already has signed up big multi year deals.

This is not just IBM, it applies to GCP/Azure/AWS as well.

nihil755y ago

I like OpenWhisk which is the basis for their serverless compute offering. Has orchestration/state-machine functionality that makes it superior to GCP Cloud Functions, and uses Docker containers which makes it more flexible than AWS Lambda.

I also really like CouchDB which IBM Cloudant is based on.

Is that enough for me to use IBM cloud? no. not really.

spydum5y ago

I suspect nobody really uses it outside of weird outsourced financial modelling/planning tools like TM1 and other apps people stopped wanting to manage themselves.

freehunter5y ago

I work as a consultant with big enterprise companies and I can assure you big enterprise companies are using IBM very heavily. As well as Oracle and HP and other uncool tech companies.

rezonant5y ago

We use Restream.io and Solar Winds Papertrail, both were down today, my guess is they use IBM Cloud itself or some rackspace that IBM's rented to other clouds, which is apparently typical at the edge of the major public cloud regions

blantonl5y ago

All of Broadcastify's audio servers (hosted with Softlayer in their Dallas datacenter) are completely unreachable and down.

I'm going to wait a bit to see if we get a status update, otherwise we'll be spinning up instances on AWS to failover (which will be enormously costly for bandwidth)

No status, no nothing, we're in the dark.

Operyl5y ago

Hey. Do you want to shoot me an email, IRC chat, or anything? I can keep you up to date with what I'm hearing from my manager.

dashesyan5y ago

Hey, I'm a customer of IBM Cloud, too. Could you share what you're hearing from them? It would be nice to know what's going on

1 more reply

Fordec5y ago

I remember I was at an IBM sponsored hackathon around 2015 where it was a requirement to use Bluemix. Over the course of the weekend the service went down for hours 3 times.

Literally this morning I was wondering what ever happened to it, like did it die a quiet death? Oh it rebranded to IBM cloud in 2017. Now this news.

I think there's an eponymous law named for this sort of thing.

kinghuang5y ago

That's funny. I've had the exact same experience with Bluemix at a Hackathon in the past. It was down for almost the entire weekend, screwing all the teams that didn't pivot early enough.

vmh19285y ago

In the Cloud Status History page scroll down to the 6:32 entry that says "Unable to Access IBM Cloud"

https://cloud.ibm.com/status?selected=history

- 2020-06-10 02:19 UTC - RESOLVED - The network operations team adjusted routing policies to fix an issue introduced by a 3rd party provider and this resolved the incident

voz_5y ago

I generally do everything on AWS or GCP, with a little Azure sometimes for personal projects. In what world does IBM beat one of those three in anything? Generally curious - how they are able to stay competitive?

twalla5y ago

Their bare metal cloud offering (SoftLayer acquisition) was actually pretty good whenever I used it about 4 years ago. Wasn’t the most intuitive API or UI but you could get a bare metal server anywhere in the world in a few minutes.

rad_gruchalski5y ago

When the wind blows in the right direction. Sometimes, your server would get stuck in provisioning for hours and only get „un-stuck” after creating a support ticket. Which, I kid you not, at one of the previous jobs, wd had automated in our provisioning popeline. Good times.

But when it worked, it worked. API was voodoo.

2 more replies

jonfw5y ago

They've got the only real managed Openshift option right now, and their managed Kubernetes services is really great and seamless IMO.

wmf5y ago

They had bare metal before Packet or AWS and inter-region traffic is free.

Operyl5y ago

Their biggest thing going for them is 100% free dark fiber private network. You do have to pay for the bigger pipe (100mbps included for each server, a minimal upcharge for gigabit), but that's pretty much a rounding error.

blantonl5y ago

Softlayer

blazefox695y ago

Fixed it for you https://github.com/ibm-cloud-docs/overview/pull/74

caiobegotti5y ago

Honest slightly cynical question: most probably someone inside the responsible team said some day that it would be very stupid to host the status page inside the same infrastructure being monitored, but they were probably ignored... what should that person do now? Say "toldya!" out loud in the postmortem meeting or simply shut up and move on because reality is that we are hired to do some stupid task and not to think for ourselves?

manquer5y ago

If such people raised concerns and they had been overridden they way you describe they would sadly have left long ago.

It is not that companies become consciously malicious or are incompetent to start with, it becomes a vicious cycle, as more and more poor management and engineering talent join, the good ones leave, and the cycle continues.

Acquisitions and merge stave off the slow slide into irrelevance for a while, till the best of the new guys leave too. Systemic cultural changes is very very hard to achieve in large organizations.

all_blue_chucks5y ago

Never humiliate a coworker in public. Instead say "both options were considered but ultimately it was decided to select option B for reason Y."

caiobegotti5y ago

My professional experience tells me that the next question will be who decided for B given Y, then you answer it and then you have a target on your SRE back, I'm afraid. Remember that the trickle-down economics works only when the shit hits the fans and what trickles down is not money.

3 more replies

pmarreck5y ago

So, the appeal to anonymous authority, I see

ethbro5y ago

What if Y = ignorance?

1 more reply

jedberg5y ago

I would mirror the attitude of the person who said no originally.

If they are receptive to feedback and clearly want to do better, I would be kind and explain why I had suggested it not be there in the first place and cite this as an example.

If they were being adamant or denying it was their fault, I'd probably be really quiet and just make subtle remarks about how it would have been better if they listened.

whyleyc5y ago

Totally unrelated Jeremy, but did you know the SSL cert on https://minops.com/ has expired?

(Was interested to see what you were up to these days, which is how I stumbled on it).

1 more reply

rezonant5y ago

Sometimes this is the only way :-/ Good to ensure you "get it in writing" when the point is eventually proven in production.

isclever5y ago

It happens a lot, when you have so much infrastructure and redundancy you think it is too big to fail. Then you lose S3 in US-East1 and break everything.

https://www.theregister.com/2017/03/01/aws_s3_outage/

mc325y ago

Don’t bite or embarrass the hand that provides you make-work...

Seriously they probably tested it and it worked in theory, just not in practice and now they fix it for reals.

sky_rw5y ago

These are the same people who recently published a white paper on how they guarantee zero downtime: https://cloud.ibm.com/docs/overview?topic=overview-zero-down...

The idea that they could even get to this point probably seemed unfathomable. It does to me.

fogetti5y ago

While that is a solid advice, still it raises the question: is there such organization which corrects itself by firing the incompetent and promoting the competent instead (the "toldyouso" guy in this case)?

Or we just simply accept and making it the norm that even the lowest level of organizational governance is corrupt?

I am serious about this, because how people perceive their own rights, their own roles, their own status, their own influence and their organization's wrongdoing will influence the attitude in the long run against each and every organization in society in my opinion.

I know that I was blowing the question out of proportion, but it bugged me to ask anyway.

1 more reply

acruns5y ago

I guess if their DR firedrill assumed their failover router hosting the status page could never go down it would pass, but come on IBM.

1 more reply

snowwrestler5y ago

Best thing to do now is to point DNS at the backup status page they discreetly set up on a free-tier EC2 server back when they got ignored...

skybrian5y ago

Ideally, someone should write a postmortem with a timeline of what happened and recommended fixes. These would then be fixed, and nobody would be blamed. (This is called a "blameless postmortem".)

But whether you can get away with that depends on culture.

vsareto5y ago

Honestly it's small compared to everything else. I'd rather leave than do told-ya-so though and put the story in the exit interview or reason for leaving.

dexwiz5y ago

I built a status page for a top cloud provider, and this was question number one from SREs.

bigiain5y ago

I built a bunch of CloudWatch monitoring for an AWS stack, and duplicated critical monitoring using a 3rd party monitoring service as well. So, because the universe hates me, the 3rd party service migrated their hosting into AWS ~18 months later... :sigh:

epc5y ago

I haven't been at IBM since 2001, but when I was there any suggestion like this would have been beaten down by multiple layers of the big grey cloud for even intimating that such a visible, key piece of IBM marketing material should be on a third party service.

danjac5y ago

They would have been told "great, we understand your concerns, please add a card to the tech backlog (or whatever local jargon is at that team) and we'll address it in the next sprint/dev cycle/quarter". That's the right way to acknowledge these complaints while preventing actual progress.

Lyren5y ago

I received communication ~15min ago that they're actively looking into the issue. I submitted the ticket roughly 20min ago. So it seems they're aware.

It doesn't help that their status page is also hosted on IBM Cloud.

whyleymOP5y ago

Found this from a user on Twitter - "Our status page for IBM Aspera is on StatusPage, so you can track here as a bank shot: https://status.aspera.io "

gatvol5y ago

Well if they cannot foresee this eventuality, what else are they missing under the hood?

julianeon5y ago

Seems pretty dumb to host a status page in a way that it could go down, when it should be a static page that is trivially hosted on CDN's worldwide.

koolba5y ago

You can’t cache it for that long though.

A better approach is to have it hosted on a different cloud platform. If you really care, you’ll set it up on a different domain and nameserver as well with a long lived redirect (cached on CDNs) from the usual status.example.com or example.com/status.

julianeon5y ago

Thanks; you're right - the caching would be a problem, so your solution makes more sense.

syshum5y ago

Over confident in their own Cloud

"Our cloud can never go completely down We are IBM, we have Watson..."

sky_rw5y ago

Their status page also seems integrated into their internal support ticketing system. It's not a traditional status page. They wanted to maintain a consistent garbage interface to keep it inline with the rest of their administrative service.

sky_rw5y ago

The most infuriating thing about this is the ZERO communication coming out of IBM Cloud. No emails. No updates to twitter. Status page down. Support lines clogged.

At least give me something I can point my customers at to show them this is not due to my incompetence.

bizt5y ago

Yep, super annoying I had to link my customers to a techcrunch page :(

shaabanban5y ago

Also still no communication from IBM that anything is wrong.

Operyl5y ago

Account managers are texting, but they have no VPN access right now.

adrr5y ago

It seems all their external network connections are down. I assume people will have to drive to the data centers to fix. I really want to see a post mortem on this outage.

1 more reply

mark-r5y ago

Let me guess, their VPN authorization runs on the IBM cloud.

1 more reply

akerro5y ago

Haha, amazon had the same problem a few years ago when they had fire in datacenter, their status checker page was hosted in the same building and was showing everything is fine, while 1000s of websites hosted on AWS were down.

shaabanban5y ago

wonder if we'll ever get a post-mortem about this... Seems to be global

Operyl5y ago

Maybe. About 3/4 of all outages get a post mortem. There's 1/4 of the time they refuse to tell us anything.

mbreese5y ago

There will have to be a post mortem on this. The convention is to be as transparent as possible as to what went wrong. This helps to let current customers know that you found the problem, and have put plans in place to make sure it doesn't happen again.

The purpose of the signalling here is two fold.

1) If convincing enough (with details), you can keep current customers from moving to a competitor.

2) It also lets new customers see how you actually handle a crisis. If they can manage the crisis well enough, then you can point to this instance to prove your technical knowhow to handle their needs.

If they don't tell anything, or aren't transparent, then they can expect a mass exodus of customers.

1 more reply

colinbartlett5y ago

Do you actually have data on that or are you conjecturing? Because I would really love to see data about that if it exists somewhere.

1 more reply

redler5y ago

I certainly hope so, considering all the IBM customers that are going to have to explain this to their customers in turn.

thephyber5y ago

How sure are we that this outage is limited to IBM cloud?

Pindom[1] had a spike of website outages from 11k => 27k.

[1] https://livemap.pingdom.com/

Nextgrid5y ago

It's most likely customers of IBM cloud whose systems rely on something hosted there and are thus down as well.

thephyber5y ago

Yes, I considered that possibility before posting.

1 more reply

AaronFriel5y ago

Ah, is this the exception that proves the rule that "no one was ever fired for buying IBM?"

Sorry to be glib, I'm sure it's a tough time for people who were sold on their cloud platform and work on it!

mark-r5y ago

Everybody's cloud goes down sometime. The big fail here was hosting their status page on the same infrastructure.

oceanswave5y ago

But usually only a single AZ or region... seems like this is bigger?

Operyl5y ago

Yup .. hit us pretty badly. Our account manager doesn't know either.

homeglue5y ago

I've seen multiple services get affected this morning including Sendgrid, Nexmo and Up bank, all at the same time. Wondering if this is related.

leetrout5y ago

Hugops.

Hope they get a root cause and a quick fix. I’m not a fan of their cloud service but I know people working on the outage and fix are stressed.

kitteh5y ago

About a month ago their Northern Virginia region was down. All the BGP prefixes associated with it disappeared from the internet (routes withdrawn). This time (I went to check when someone mentioned it) they kept advertising, but all traffic went nowhere once it got into their network. Curious to see if there is an RFO released.

aiisjustanif5y ago

I wish we had a record of this.

kitteh5y ago

I do. I store all this stuff. Where should I put it?

nonines5y ago

This looks related (smoking gun?) https://status.aspera.io/incidents/t9r03x71dxkl

>> A 3rd party network provider was advertising routes which resulted in our WW traffic becoming severely impeded.

rbanffy5y ago

It can only be attributable to human error.

No IBM computer has ever made a mistake or distorted information. They are all, by any practical definition of the words, foolproof and incapable of error.

stevehawk5y ago

guess they didn't learn from AWS and hosting their status pages (in particular their icons) in S3

bantec5y ago

It’s a second significant issue for last year with IBM( absolutely inconsistent for critical infrastructure (we are FinTech)

cerw5y ago

Been like that for last 1h, Network packet Sydney (GCP) to Sydney (IBM) 62% packet loss

ck25y ago

even weather.com was down but someone broke ebay too

       Fastly error: unknown domain: www.ebay.com. Please check that this domain has been added to a service.

toast05y ago

weather.com makes sense. IBM bought the weather channel a while ago, hosting is likely tied to IBM Cloud at this point (although it looks like it's fronted by Akamai)

vmh19285y ago

IBM bought the technology part called the Weather Company. That's the part that gathers weather info from all over and makes it available.

The cable TV channel is still independent.

supernova87a5y ago

Aha, I guess explains why Wunderground.com was out too.

pmarreck5y ago

Imagine hosting your status page on a different domain

9nGQluzmnq3M5y ago

DNS worked fine here, this was an infra issue.

nadavami5y ago

It seems like the status page just came back up.

woakas5y ago

Our site (ubidots.com) does not have a complete down, but the IBM network has a high latency.

someguy123215y ago

heads be rolling tomorrow!

anon1020105y ago

A quick check of cloudflare's isbgpsafeyet page

IBM Cloud - unsafe

At least AWS signs their routes I think.

If you can't even sign your own routes - hard to have a ton of pity.

kortilla5y ago

Signing routes doesn’t mean others reject unsigned routes. AWS is just as vulnerable to hijacking as anyone.

j / k navigate · click thread line to collapse

181 comments

colinbartlett5y ago

My side project StatusGator monitors status pages (including IBM's ill-fated page) and I'm seeing more than 10% of the nearly 800 services we monitor having an outage right now.

So it appears to affect anyone who depends on IBM Cloud.

bberenberg5y ago

pas5y ago

Maybe it helps with doing a sanity check before picking a provider. And, I guess, at a basic level it helps with accountability/transparency.

1 more reply

Jedd5y ago

Big fans of StatusGator here.

Do you have similar %'s of monitored cloud services that have gone off the air during other providers' outages?

colinbartlett5y ago

Thanks and that's a GREAT idea for some detailed analysis. I have been trying to make better use of the 5+ years of status page history stashed in the cupboard.

1 more reply

ComputerGuru5y ago

So what are HNers using IBM Cloud for and where do you see that it has an edge over AWS offerings (where an overlap exists, obviously)?

paranoidrobot5y ago

My previous job used Softlayer heavily.

Two of the biggest advantages were:

Free international/regional transfer. We took significant advantage of this to move data around. We'd replicate TBs of data around.

At various times management and dev teams would complain and say that we should move everything to AWS (or whatever cloud provider they'd just met with at a conference).

jgalt2125y ago

I really would like to see an AWS memo which maps out common use cases and expected costs (selling points) vs actual use cases and actual costs (pain points).

2 more replies

toast05y ago

We used Softlayer (rebranded to IBM Cloud, and affected by this) at my last job. For the most part, their service pretty much just works; clearly not today. :)

We had a couple thousand bare metal servers, and barely used any of their API stuff.

Bandwidth prices used to be pretty reasonable, but they've adopted AWS style obscene pricing. At least they still let you use the private network for free (including to other datacenters).

dang5y ago

3 more replies

dsmcr5y ago

2 more replies

nixgeek5y ago

Last job for me was also a few thousand bare metal servers at SoftLayer. Acquired and moved to that infrastructure instead. Wonder if its the same acquisition? :-)

dsmcr5y ago

nojito5y ago

Price isn’t a benefit?

1 more reply

Xenoamorphous5y ago

MS and Google do provide those features though.

dsmcr5y ago

fieldmarshal5y ago

We have been leasing bare metal servers since the pre-IBM Softlayer days.

We were much happier with the service and support we received prior to the IBM acquisition.

sky_rw5y ago

Currently on them because we have an OpenVPN based infrastructure that is very challenging to migrate.

Lastly the majority of our customers are in the midwest or Texas, and the proximity of their Dallas DC was a huge performance win for us.

manquer5y ago

Rarely it is a just technical decision, usually money is the reason.

In small and mid size organizations the CSP gave better pricing, or they help with your sales etc

This is not just IBM, it applies to GCP/Azure/AWS as well.

nihil755y ago

I also really like CouchDB which IBM Cloudant is based on.

Is that enough for me to use IBM cloud? no. not really.

spydum5y ago

I suspect nobody really uses it outside of weird outsourced financial modelling/planning tools like TM1 and other apps people stopped wanting to manage themselves.

freehunter5y ago

I work as a consultant with big enterprise companies and I can assure you big enterprise companies are using IBM very heavily. As well as Oracle and HP and other uncool tech companies.

rezonant5y ago

blantonl5y ago

All of Broadcastify's audio servers (hosted with Softlayer in their Dallas datacenter) are completely unreachable and down.

I'm going to wait a bit to see if we get a status update, otherwise we'll be spinning up instances on AWS to failover (which will be enormously costly for bandwidth)

No status, no nothing, we're in the dark.

Operyl5y ago

Hey. Do you want to shoot me an email, IRC chat, or anything? I can keep you up to date with what I'm hearing from my manager.

dashesyan5y ago

Hey, I'm a customer of IBM Cloud, too. Could you share what you're hearing from them? It would be nice to know what's going on

1 more reply

Fordec5y ago

I remember I was at an IBM sponsored hackathon around 2015 where it was a requirement to use Bluemix. Over the course of the weekend the service went down for hours 3 times.

Literally this morning I was wondering what ever happened to it, like did it die a quiet death? Oh it rebranded to IBM cloud in 2017. Now this news.

I think there's an eponymous law named for this sort of thing.

kinghuang5y ago

That's funny. I've had the exact same experience with Bluemix at a Hackathon in the past. It was down for almost the entire weekend, screwing all the teams that didn't pivot early enough.

vmh19285y ago

In the Cloud Status History page scroll down to the 6:32 entry that says "Unable to Access IBM Cloud"

https://cloud.ibm.com/status?selected=history

- 2020-06-10 02:19 UTC - RESOLVED - The network operations team adjusted routing policies to fix an issue introduced by a 3rd party provider and this resolved the incident

voz_5y ago

twalla5y ago

rad_gruchalski5y ago

But when it worked, it worked. API was voodoo.

2 more replies

jonfw5y ago

They've got the only real managed Openshift option right now, and their managed Kubernetes services is really great and seamless IMO.

wmf5y ago

They had bare metal before Packet or AWS and inter-region traffic is free.

Operyl5y ago

blantonl5y ago

Softlayer

blazefox695y ago

Fixed it for you https://github.com/ibm-cloud-docs/overview/pull/74

caiobegotti5y ago

manquer5y ago

If such people raised concerns and they had been overridden they way you describe they would sadly have left long ago.

Acquisitions and merge stave off the slow slide into irrelevance for a while, till the best of the new guys leave too. Systemic cultural changes is very very hard to achieve in large organizations.

all_blue_chucks5y ago

Never humiliate a coworker in public. Instead say "both options were considered but ultimately it was decided to select option B for reason Y."

caiobegotti5y ago

3 more replies

pmarreck5y ago

So, the appeal to anonymous authority, I see

ethbro5y ago

What if Y = ignorance?

1 more reply

jedberg5y ago

I would mirror the attitude of the person who said no originally.

If they are receptive to feedback and clearly want to do better, I would be kind and explain why I had suggested it not be there in the first place and cite this as an example.

If they were being adamant or denying it was their fault, I'd probably be really quiet and just make subtle remarks about how it would have been better if they listened.

whyleyc5y ago

Totally unrelated Jeremy, but did you know the SSL cert on https://minops.com/ has expired?

(Was interested to see what you were up to these days, which is how I stumbled on it).

1 more reply

rezonant5y ago

Sometimes this is the only way :-/ Good to ensure you "get it in writing" when the point is eventually proven in production.

isclever5y ago

It happens a lot, when you have so much infrastructure and redundancy you think it is too big to fail. Then you lose S3 in US-East1 and break everything.

https://www.theregister.com/2017/03/01/aws_s3_outage/

mc325y ago

Don’t bite or embarrass the hand that provides you make-work...

Seriously they probably tested it and it worked in theory, just not in practice and now they fix it for reals.

sky_rw5y ago

These are the same people who recently published a white paper on how they guarantee zero downtime: https://cloud.ibm.com/docs/overview?topic=overview-zero-down...

The idea that they could even get to this point probably seemed unfathomable. It does to me.

fogetti5y ago

Or we just simply accept and making it the norm that even the lowest level of organizational governance is corrupt?

I know that I was blowing the question out of proportion, but it bugged me to ask anyway.

1 more reply

acruns5y ago

I guess if their DR firedrill assumed their failover router hosting the status page could never go down it would pass, but come on IBM.

1 more reply

snowwrestler5y ago

Best thing to do now is to point DNS at the backup status page they discreetly set up on a free-tier EC2 server back when they got ignored...

skybrian5y ago

Ideally, someone should write a postmortem with a timeline of what happened and recommended fixes. These would then be fixed, and nobody would be blamed. (This is called a "blameless postmortem".)

But whether you can get away with that depends on culture.

vsareto5y ago

Honestly it's small compared to everything else. I'd rather leave than do told-ya-so though and put the story in the exit interview or reason for leaving.

dexwiz5y ago

I built a status page for a top cloud provider, and this was question number one from SREs.

bigiain5y ago

epc5y ago

danjac5y ago

Lyren5y ago

I received communication ~15min ago that they're actively looking into the issue. I submitted the ticket roughly 20min ago. So it seems they're aware.

It doesn't help that their status page is also hosted on IBM Cloud.

whyleymOP5y ago

Found this from a user on Twitter - "Our status page for IBM Aspera is on StatusPage, so you can track here as a bank shot: https://status.aspera.io "

gatvol5y ago

Well if they cannot foresee this eventuality, what else are they missing under the hood?

julianeon5y ago

Seems pretty dumb to host a status page in a way that it could go down, when it should be a static page that is trivially hosted on CDN's worldwide.

koolba5y ago

You can’t cache it for that long though.

julianeon5y ago

Thanks; you're right - the caching would be a problem, so your solution makes more sense.

syshum5y ago

Over confident in their own Cloud

"Our cloud can never go completely down We are IBM, we have Watson..."

sky_rw5y ago

The most infuriating thing about this is the ZERO communication coming out of IBM Cloud. No emails. No updates to twitter. Status page down. Support lines clogged.

At least give me something I can point my customers at to show them this is not due to my incompetence.

bizt5y ago

Yep, super annoying I had to link my customers to a techcrunch page :(

shaabanban5y ago

Also still no communication from IBM that anything is wrong.

Operyl5y ago

Account managers are texting, but they have no VPN access right now.

adrr5y ago

It seems all their external network connections are down. I assume people will have to drive to the data centers to fix. I really want to see a post mortem on this outage.

1 more reply

mark-r5y ago

Let me guess, their VPN authorization runs on the IBM cloud.

1 more reply

akerro5y ago

shaabanban5y ago

wonder if we'll ever get a post-mortem about this... Seems to be global

Operyl5y ago

Maybe. About 3/4 of all outages get a post mortem. There's 1/4 of the time they refuse to tell us anything.

mbreese5y ago

The purpose of the signalling here is two fold.

1) If convincing enough (with details), you can keep current customers from moving to a competitor.

If they don't tell anything, or aren't transparent, then they can expect a mass exodus of customers.

1 more reply

colinbartlett5y ago

Do you actually have data on that or are you conjecturing? Because I would really love to see data about that if it exists somewhere.

1 more reply

redler5y ago

I certainly hope so, considering all the IBM customers that are going to have to explain this to their customers in turn.

thephyber5y ago

How sure are we that this outage is limited to IBM cloud?

Pindom[1] had a spike of website outages from 11k => 27k.

[1] https://livemap.pingdom.com/

Nextgrid5y ago

It's most likely customers of IBM cloud whose systems rely on something hosted there and are thus down as well.

thephyber5y ago

Yes, I considered that possibility before posting.

1 more reply

AaronFriel5y ago

Ah, is this the exception that proves the rule that "no one was ever fired for buying IBM?"

Sorry to be glib, I'm sure it's a tough time for people who were sold on their cloud platform and work on it!

mark-r5y ago

Everybody's cloud goes down sometime. The big fail here was hosting their status page on the same infrastructure.

oceanswave5y ago

But usually only a single AZ or region... seems like this is bigger?

Operyl5y ago

Yup .. hit us pretty badly. Our account manager doesn't know either.

homeglue5y ago

I've seen multiple services get affected this morning including Sendgrid, Nexmo and Up bank, all at the same time. Wondering if this is related.

leetrout5y ago

Hugops.

Hope they get a root cause and a quick fix. I’m not a fan of their cloud service but I know people working on the outage and fix are stressed.

kitteh5y ago

aiisjustanif5y ago

I wish we had a record of this.

kitteh5y ago

I do. I store all this stuff. Where should I put it?

nonines5y ago

This looks related (smoking gun?) https://status.aspera.io/incidents/t9r03x71dxkl

>> A 3rd party network provider was advertising routes which resulted in our WW traffic becoming severely impeded.

rbanffy5y ago

It can only be attributable to human error.

No IBM computer has ever made a mistake or distorted information. They are all, by any practical definition of the words, foolproof and incapable of error.

stevehawk5y ago

guess they didn't learn from AWS and hosting their status pages (in particular their icons) in S3

bantec5y ago

It’s a second significant issue for last year with IBM( absolutely inconsistent for critical infrastructure (we are FinTech)

cerw5y ago

Been like that for last 1h, Network packet Sydney (GCP) to Sydney (IBM) 62% packet loss

ck25y ago

even weather.com was down but someone broke ebay too

       Fastly error: unknown domain: www.ebay.com. Please check that this domain has been added to a service.

toast05y ago

weather.com makes sense. IBM bought the weather channel a while ago, hosting is likely tied to IBM Cloud at this point (although it looks like it's fronted by Akamai)

vmh19285y ago

IBM bought the technology part called the Weather Company. That's the part that gathers weather info from all over and makes it available.

The cable TV channel is still independent.

supernova87a5y ago

Aha, I guess explains why Wunderground.com was out too.

pmarreck5y ago

Imagine hosting your status page on a different domain

9nGQluzmnq3M5y ago

DNS worked fine here, this was an infra issue.

nadavami5y ago

It seems like the status page just came back up.

woakas5y ago