Fly.io Status – Consul cluster outage (opens in new tab)

(status.flyio.net)

126 pointspurututu3y ago118 comments

118 comments

This has been a rough week, and I'm sorry we broke peoples' apps. We had a big Nomad outage on Monday, and then a suspiciously similar Consul outage today. Both tipped over faster than we could detect and mitigate, and we ended up having to do serious surgery to build entirely new Consul/Nomad clusters.

There's nothing to brag about here, I just wanted to let y'all know we're listening (even when things aren't on the HN front page).

wjossey3y ago

This stuff is hard. As someone who runs infra teams for a living, these are the worst kinds of weeks.

Hang in there. You all will learn from this and be better for it. Your architecture will improve. Customers will give you a second chance. This too shall pass.

Sending positive vibes.

srhyne3y ago

Lovely response. Ah, kindness. So refreshing to see.

mrkurt3y ago

atonse3y ago

We had mysterious consul outages (and related nomad outages) causing us to never deploy our new hashicorp stack to production.

Shame cuz we were excited about our nomad+consul+vault setup and invested a lot of money into building it. But just didn’t have the time or enough depth of expertise to babysit it.

Mizza3y ago

From my experience with the Hashi stack, I don't think it's a coincidence that Fly has a lot of downtime and are a major Hashi user. Terraform makes excellent bait though.

Still love using Fly, please add static assets hosting/CDN.

atonse3y ago

How is this possible? How is consul not self—healing? It just seems so brittle in a way even database clusters aren’t.

4 more replies

jen203y ago

The simpler explanation is that running products designed for LAN usage on a WAN is a fundamentally bad plan, as the folks over at fly acknowledge, even in this thread.

Meanwhile, hundreds of thousands of Consul, Nomad and Vault clusters used appropriately work perfectly well…

suryao3y ago

Fly is building everything in hard mode - since they are not layering on top of an existing cloud like pretty much everyone else (heroku, render, railway, ...).

It's either very smart (if they pull it off) because they will have a ginormous cost advantage or they fail.

I'm personally of the opinion that the ux on top of aws/gcp/... is worse than a doo-doo in a shoe. However, they are as stable as can be (all complex systems go down once in a while). There are very few mature projects that do not rely on aws/gcp/... managed services anyway. Might as well put in the little bit of effort to set yourself up for the future instead of painful migrations. This obviously doesn't hold for hobby projects.

In any case, I have a lot of respect for the engineering that fly does. Kudos.

candiddevmike3y ago

Are they really building everything in hard mode or do they just have a bad architecture?

mrkurt3y ago

Yes. Both. This outage was caused by a bad architectural decision. We had an incident a few weeks ago caused by "hard mode".

lghh3y ago

I really appreciate the honesty here. I’m not a fly customer (no use case for it) but your transparency is admirable. Wish y’all the best.

3np3y ago

As a fellow hashistack operator I'd love to hear what the bad decision was.

2 more replies

faizshah3y ago

AWS invested a ton into limiting the blast radius of failures by isolating AZs, regions and using a cellular (service level isolated shards) architecture. I am surprised these ideas have not propagated to newer companies trying to build clouds: https://m.youtube.com/watch?v=swQbA4zub20

AWS isn’t perfect but these lessons were learned by fire because these sorts of global outages can seriously harm reputations.

HyperSane3y ago

I feel that this is one of the biggest advantages AWS has over Azure. AWS has never had truly global outages the way that Azure has had with Azure AD

1 more reply

suryao3y ago

They build everything from scratch - on bare metal, including sourcing hardware (though I'd presume they use a data center manager for it). Arch, from their engineering blogs, is pretty sound.

luhn3y ago

Relevant: "Reliability: It's not great" from last week https://news.ycombinator.com/item?id=35044516

They even specifically call out Consul as a source of trouble.

> We propagate app instance and health information across all our regions. That’s how our proxies know where to route requests, and how our DNS servers know what names to give out.

> We started out using HashiCorp Consul for this. But we were shoehorning Consul, which has a centralized server model design for individual data center deployments, into a global service discovery role it wasn’t suited for. The result: continuously stale data, a proxy that would route to old expired interfaces, and private DNS that would routinely have stale entries.

jen203y ago

They call out THEIR USAGE of Consul as a source of trouble. This is quite different.

markthethomas3y ago

Been a fan of fly and have had most, if not all, of my side and semi-side projects on there for some time now. But...the ratio of good/fun/snarky blog posts to reliable service has gotten a bit too large for me, starting to look for other providers at this point just in case they can't turn this trend around. Honestly been a good object lesson for me in the importance of backing up marketing/hype/"mind-share" stuff w/ absolute rock-solid performance/reliability or just forgoing the former for the latter.

As an aside, it's also taking down some decently-load-bearing web infra like unpkg => https://www.unpkg.com/

zachallaun3y ago

Relevant response from the Fly community forums: https://community.fly.io/t/frequent-outages-is-really-demons...

markthethomas3y ago

Yeah, I saw; I've kept up w/ everything pretty closely. Still decently frustrating as a paying customer, but I hope they can figure it out. If they can and can show some real reliability, I'll be an even bigger fan.

zachallaun3y ago

Yep! More putting it out there for other folks. I’m also a somewhat frustrated paying customer, but as I’m dealing with my own growing pains, I relate to what they’re going through. I’ve personally migrated my DB to Crunchy to somewhat mitigate the risk.

1 more reply

ericpauley3y ago

Wow, part of Delaware’s tax website was hanging on unpkg today, now I know why!

markthethomas3y ago

(unpkg seems to be up now)

pawelduda3y ago

I really really wanted to like and recommend fly.io but I wouldn't risk deploying anything more than a side project to tinker with, given how many random issues I encountered in a relatively short development time. It was a simple Phoenix app which made me wonder "am I doing things totally wrong?" quite a few times, after exhausting all info sources. But when I tried the same process the next day, it would deploy just fine. Plus the outages that appear to be getting more frequent don't make me optimistic.

At least they're transparent about their issues, gotta give them that. I still kinda root for them, maybe they'll make a comeback.

mrcwinn3y ago

Same. I’m so disappointed because I’ve been rooting for them. We were close to a major deployment/migration (well, major as is mid four figures per month, not major like Google) but they were removed from the decision set. It would not have been responsible to bet on them at this time. I hope they get this sorted - they’re really good folks!

mrkurt3y ago

Thank you! I'm both sorry it didn't work out (because $$$$) and also glad we didn't create any agony for you. Someday, we hope to create mild irritation for you, though, if we can.

atonse3y ago

I am feeling similarly. We’ve got a few apps in fly and have convinced devs to use it for their side projects. We’re excited about the promise of fly and were considering the HIPAA plan.

But these stability issues actually make me more nervous about the fact that I’d have to manage my own postgres cluster and have to learn how to recover it in such an event. AWS RDS has made me soft!

Wishing you guys the best. We’ll still use fly for QA until a few of these issues are sorted out. And until there’s fully managed pg (first party or third party)

1 more reply

mrcwinn3y ago

Thanks for replying and saying that.

drewbug013y ago

I love this update:

“ We are working to build a new Consul cluster with 10x the RAM. We aren't yet sure, but believe a routine DNS change might have created a thundering herd problem causing Consul servers to immediately increase RAM usage by 500%. This is not ideal.”

_This is not ideal._

gzer03y ago

Interestingly, Roblox went down for 73 hours due to a "unique" issue with Consul as well [1].

Great read on how the issue was approached, handled, and ultimately remediated.

[1] https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

jeremyjh3y ago

Most often the issues that take down a site are with core services like network routing, DNS and service discovery. Consul gets mentioned because it’s in that business and isn’t a standard so it gets called out specifically. Zookeeper, HAProxy and various cluster managers also get slagged for this stuff and yeah, sometimes it’s their fault but that’s what it means to be in that business.

throwdbaaway3y ago

https://github.com/hashicorp/consul/pull/12080 - this should be the Consul issue that brought down Roblox

felixding3y ago

Was affected by the outage. Didn't know about it so I thought it was just another crash on Fly.io.

Tried to restart our app from the command line, only to be told they had disabled the API. And there is no restart feature on their dashboard. So all I could do was watching flyio logs telling me that our apps were down.

Sigh.

We moved from Heroku to Fly.io only this January, and are already considering moving away from it. The reliability is miserable at best. And so many basic features are missing. Yes it's much cheaper than Heroku, but we ended up paying much more time/resource/money dealing with its glitches. Defeats the purpose why we used a PaaS in the first place.

mrkurt3y ago

I know blocking deploys sucks, I'm sorry. We disabled them to prevent otherwise healthy apps from going down. When Consul fails, we can't boot new app processes. The ones that are already running continue running. A restart is roughly the same as a deploy, in this respect.

satvikpendem3y ago

At this point I'm not sure why one wouldn't use something like Hetzner and slap Coolify or Dokku or something else on it.

Benjamin_Dobell3y ago

You're right. We've been on Fly.io for 6 months[1] and it's been nothing but pain. ~10 years ago I took a start-up off an EC2 distributed set-up and moved them to a simple Dokku & Linode single VPS infra (plus separate staging env - https://github.com/glassechidna/dokku-graduate). Most content was served from S3 via a CDN, so workload was light. That simple VPS set up was super reliable and served us well for over 5 years. We eventually outgrew the infra and deployed a K8s cluster (on AWS). I left a short while after we were acquired, but I believe the K8s infra is holding strong. Unfortunately, this latest generation of PaaS really aren't living up to expectations.

[1] We're using so little infra at present that we're within their free usage tier. However, I want to clarify that this isn't because we aren't willing to pay, we specifically want to pay for reliable managed offerings. That's actually the entire point! If Fly.io can deliver on their vision, we'd gladly be billed at 100x the current usage rates.

satvikpendem3y ago

I'm waiting for Coolify's Kubernetes support, personally, I'd love to use it as a pseudo-managed service while still having much lower costs and higher uptime.

alxmng3y ago

This. It’s way more performant too, because you can host DB and other services from the same machine.

You don’t need to orchestrate a complex cluster to serve thousands or even millions of users. You can scale to hundreds of gigs of memory on a single machine nowadays.

rtpg3y ago

Ops from scratch are annoying compared to the theoretical niceness of just pushing up a docker image.

Though I think a lot of this is incidental to just not really knowing the deal, and ops from scratch mean you have to make a lot of tiny decisions like "OK how do I get this package over here, how do I set it up, do I wipe the VM on OS-level udpates, do I need scripts for resetting the machine..." Having pre-made decisions for a bunch of questions means you aren't spending a bunch of time on tedious stuff when starting up a project.

satvikpendem3y ago

I can push up a Docker image or git push to deploy just fine with Coolify or Dokku. I've been using them for my projects for a while with no trouble, plus they're cheaper and more performant than paid PaaS.

overbytecode3y ago

Even with coolify or dokku you still have to manage the machine, those tools just simplify deployment. You still have to take care of host-level security and maintenance. Which is a hassle when all you really want is to stick your app somewhere and have it run.

satvikpendem3y ago

Yeah I mean the host level maintenance isn't really a big issue, I can already stick my app somewhere and have it run, after the initial host setup. Maintenance afterwards is also pretty minimal.

kbumsik3y ago

I have seen some issues around Consul these days.

As a person with no background in distributed systems, I am wondering why people choose Consul over alternatives. Are there features that etcd doesn't offer?

mrkurt3y ago

We chose Nomad and adopted Consul as a result. Nomad and Consul work well together.

I don't believe etcd would have been any better for us, though. Centralized service discovery that runs through raft consensus doesn't make a lot of sense for the things we need to do. And when I've had etcd blow up on me in the past, it's been similarly painful to recover from.

aeyes3y ago

Most people only use etcd at small scale. If you try to store 10 or even 100GB in etcd you are going to run into uncommon problems.

Most people don't even know that the Kubernetes control plane by default has a hard limit on etcd size. It used to be 2GB, not sure what it is now.

mdaniel3y ago

I would need a citation on the kubernetes control plane having any such hard limit. etcd is its own little snowflake, and I could very easily imagine it having some bad default value like that, or even kubeadm improperly configuring it

However, related to that, for big-time clusters (q.v. https://news.ycombinator.com/item?id=35174655 and https://news.ycombinator.com/item?id=25907312) one should without question move events over into their own etcd cluster: https://openai.com/research/scaling-kubernetes-to-2500-nodes...

1 more reply

ivzhh3y ago

ByteDance replaces etcd with kubebrain [1], which is backed by their own KV store (TiKV seems also supported).

The single-group raft is the hard limit.

[1]: https://github.com/kubewharf/kubebrain

1 more reply

kbumsik3y ago

Doesn't Consul have the similar storage limit btw?

I have seen very few strongly consistent distributed KV store that scales beyond 10GB+

1 more reply

dilyevsky3y ago

I’ve run it with over 50G under heavy load (10k+ qps) and it was fine. It’s pretty sensitive to disk latency though

grrdotcloud3y ago

Raft is amazing and totally frustrating.

I think I understand how you're using it and curious if you've considered how AWS STS API manages their cross region syncing gets solved.

kbumsik3y ago

Thanks for the answer!

AFAIK doesn't Consul also use Raft?

pcthrowaway3y ago

Etcd is really only for basic config.

If you want apps to discover each other and be able to communicate effortlessly, even across datacenters, Consul, in theory, enables this.

I say in theory because I couldn't get federated Consul actually working.

convolvatron3y ago

discovery isn't that hard a problem that you should cede your agency to a external party like Hashicorp

I used consul for a clustered service once, it was worth it for bringup. but I when I had problems I just wrote one in a couple days since I'd done so several times before. and it didn't fail for all the years that product was running.

throwaway3838g3y ago

I attempted to deploy a simple app on Fly a couple of weeks ago, but porting it from heroku became a nightmare, servers crashing, cryptic error messages, etc. Maybe I'm in the minority but in any case my experience with Fly definitely left me questioning the hype around it.

mrkurt3y ago

There are really only a few frameworks where our experience approaches Heroku. And even for those, it's only the newest versions. Phoenix, Rails, Laravel, and Remix are all pretty seamless to launch.

Most others require pretty decent Docker knowledge.

HL33tibCe73y ago

Respect to anybody who is an SRE at fly.io. Couldn’t pay me enough to do that job

abledon3y ago

Markdowns forged in god-steel coming out from this incident

abofh3y ago

They just hired their first if I recall correctly. I feel for their customers more than I do for their shareholders

mrkurt3y ago

We've scaled infra ops from 3 to 7 people in the past few weeks. Our very first VP was a VP Infra Ops, because that's the thing we have to get best at to succeed as a business.

Note that we grew the whole company from 25 to 60 over the last six months.

rtpg3y ago

No offense but it's kind of wild to me that y'all had 3 infra operations people out of 60 hires.

As someone who had to do SRE-style work in a smaller company for a long time despite obsetnsibly being a backend dev, the institutional knowledge you get from "real" SRE people is so valuable, and makes me a bit hopeful for the future.

1 more reply

aeyes3y ago

You might want to slow down on hiring, more people doesn't equal to solving the problem faster or better. It could be better to queue new sign ups to your service for a while, even if it's painful.

abofh3y ago

You've been around for a decade, I'm not giving you pity points. Can you provide what we want or not?

1 more reply

sergiomattei3y ago

I’m rooting for Fly. I use them myself for a project, and love the service.

However, their transparency into outages and service rough edges is a double-edged sword: they’re building a reputation for unreliable software. It’s a shame to see this major outage happen right after last week’s post, it almost confirms the stereotype.

However, even with these flaws, I still think they’re building the best hosting out there. They’re taking bold risks and doing what others aren’t. I wish them the best.

mcsniff3y ago

> they’re building a reputation for unreliable software

This is a terrific way to word what might be happening unconsciously.

Fly posts about how hard things are during and after service outages -- while I also love the transparency, most people don't want to 'be a passenger on a plane that's being built while it's flying' especially when it comes to their business, myself included.

pm903y ago

> We are working to build a new Consul cluster with 10x the RAM.

Oh boy. I wouldn’t wanna be the people doing this. Working with infrastructure is hard. Doing it under tight SLAs? Ugh. I really hope the people working on this are being well supported.

mrkurt3y ago

You wouldn't necessarily know this from the outside, but we have _exceptional_ internal support when things go sideways. This is relatively new, up until about two months ago most incidents were run by 1.5 people. We had 7 people working this one today.

throwdbaaway3y ago

I don't really know him, but from what I can tell, https://github.com/wjordan is at least equivalent to 2.0 people.

mrkurt3y ago

Accurate.

capableweb3y ago

1. fly.io SLA only covers users on the Enterprise plan

2. The SLA fly.io has commits to 99.9% uptime, meaning they can "afford" ~1.5m downtime daily, or ~40m monthly. AWS "offers" 99.99% (~4m monthly) if I recall correctly, but their scale is also wildly different obviously.

js4ever3y ago

That's the issue with centralized infra... I expect it to be less and less stable the more customers they have. I still wish them good luck.

On my side I took the opposite direction, each workload is shared nothing.

Thaxll3y ago

They seem to have a lot of issues with Consul, is it the design of Consul or the way they use it that is the problem?

sidlls3y ago

Both, though the latter is likely due to marketing/promises from HashiCorp. Consul (and the entire hashicorp stack, really) is overengineered, under-optimized, and generally terrible to use at any scale beyond "small".

jen203y ago

And yet the comment from a member of the actual team in question underneath says the opposite…

mrkurt3y ago

The design of Consul is wrong for what we need to do. Consul has been pretty good when it's running, but it's a huge pain in the ass to recover when it falls over. And when it does fall over, it's usually with no notice.

berkle44553y ago

Roblox had a massive 3-day outage [1] in October 2021 due to a Consul feature that didn’t work as expected.

My gut with Consul is don’t use it for high-load distributed services.

[1] https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

jimmyl023y ago

The Roblox outage seemed like a pretty one-off instance due to a hard to catch bug. Consul still seems like a great choice and it looks like Roblox continues to use it at their scale.

HL33tibCe73y ago

The latter (they openly admit as such)

pcthrowaway3y ago

They say that, but they're also being actively supported by Hashicorp right now (one would presume), so they really need to maintain a good working relationship.

I don't have a relationship with Hashicorp, and have tried using Consul. Everything about it is amazing in theory, but you might need a few years of experience with kube, consul, go, and maybe even the hashicorp stack to even begin debugging when things don't work as advertised.

I still think my company is going to take another stab at consul in the future, because we do need service discovery. But they're advertising a solution to an incredibly hard problem with a shit ton of variations in network topology and infra that it should (theoretically) work on. I imagine if you stay on the happy path everything works out just fine with Consul (even then, maybe only most of the time). The problem is that they don't spell out what the happy path is, and that all the other knobs they expose off to the side are actually down paths beleagured by dragons.

fastest9633y ago

To add a data point we've been using Consul globally for several years now without any major outages. We do close to 50k qps with Consul at peak running on single digit cores per DC.

1 more reply

jen203y ago

Out of interest why would one presume that they are being actively supported? I haven’t read everything about this saga, but I’ve never seen any mention of a commercial relationship.

1 more reply

abofh3y ago

The former ish -- they relied on consul marketing that the hammer fit the square hole. Hashicorp has been pretty bad about marketing themselves as the right tool for any job, but they really only fit the narrowest of tasks before you find yourself needing an alternative or being compelled to buy a support contact.

It's atlassian from Arkansas, just faster

akerl_3y ago

Can you cite this? Because it seems like the opposite is true: https://news.ycombinator.com/item?id=35048318

abofh3y ago

You seem to have cited exactly why I don't recommend my clients use a hashi stack, it seems like you've failed to make a point?

1 more reply

seabrookmx3y ago

Why not both?

simonw3y ago

"This impacts queries to our API, including creating and modifying apps, as well as incoming network requests for recently deployed apps."

Would be really interested to understand why it affects recently deployed apps but not apps that are already established - something to do with how the Fly Router works?

mrkurt3y ago

We still pipe service discovery through Consul, we just propagate it with a different, gossip based mechanism. Services are stored in local sqlite DBs on every host that runs our Proxy. They are designed to keep running, even when we can't get updates to them.

This outage prevented us from writing services to Consul, so we couldn't read them back out. Nomad will only really write service information to Consul, so we're kind of stuck with Consul in the loop until we're fully off Nomad.

pa7ch3y ago

From my experience etcd would have been a better choice for maturity if they don't need the gossip stuff.

beoberha3y ago

This shit is hard. Running a cloud service at one of the Big 3 is hard, I can’t imagine doing it with such a small team with your own infra.

j / k navigate · click thread line to collapse

118 comments

mrkurt3y ago

There's nothing to brag about here, I just wanted to let y'all know we're listening (even when things aren't on the HN front page).

wjossey3y ago

This stuff is hard. As someone who runs infra teams for a living, these are the worst kinds of weeks.

Hang in there. You all will learn from this and be better for it. Your architecture will improve. Customers will give you a second chance. This too shall pass.

Sending positive vibes.

srhyne3y ago

Lovely response. Ah, kindness. So refreshing to see.

mrkurt3y ago

atonse3y ago

We had mysterious consul outages (and related nomad outages) causing us to never deploy our new hashicorp stack to production.

Shame cuz we were excited about our nomad+consul+vault setup and invested a lot of money into building it. But just didn’t have the time or enough depth of expertise to babysit it.

Mizza3y ago

From my experience with the Hashi stack, I don't think it's a coincidence that Fly has a lot of downtime and are a major Hashi user. Terraform makes excellent bait though.

Still love using Fly, please add static assets hosting/CDN.

atonse3y ago

How is this possible? How is consul not self—healing? It just seems so brittle in a way even database clusters aren’t.

4 more replies

jen203y ago

The simpler explanation is that running products designed for LAN usage on a WAN is a fundamentally bad plan, as the folks over at fly acknowledge, even in this thread.

Meanwhile, hundreds of thousands of Consul, Nomad and Vault clusters used appropriately work perfectly well…

suryao3y ago

Fly is building everything in hard mode - since they are not layering on top of an existing cloud like pretty much everyone else (heroku, render, railway, ...).

It's either very smart (if they pull it off) because they will have a ginormous cost advantage or they fail.

In any case, I have a lot of respect for the engineering that fly does. Kudos.

candiddevmike3y ago

Are they really building everything in hard mode or do they just have a bad architecture?

mrkurt3y ago

Yes. Both. This outage was caused by a bad architectural decision. We had an incident a few weeks ago caused by "hard mode".

lghh3y ago

I really appreciate the honesty here. I’m not a fly customer (no use case for it) but your transparency is admirable. Wish y’all the best.

3np3y ago

As a fellow hashistack operator I'd love to hear what the bad decision was.

2 more replies

faizshah3y ago

AWS isn’t perfect but these lessons were learned by fire because these sorts of global outages can seriously harm reputations.

HyperSane3y ago

I feel that this is one of the biggest advantages AWS has over Azure. AWS has never had truly global outages the way that Azure has had with Azure AD

1 more reply

suryao3y ago

They build everything from scratch - on bare metal, including sourcing hardware (though I'd presume they use a data center manager for it). Arch, from their engineering blogs, is pretty sound.

luhn3y ago

Relevant: "Reliability: It's not great" from last week https://news.ycombinator.com/item?id=35044516

They even specifically call out Consul as a source of trouble.

> We propagate app instance and health information across all our regions. That’s how our proxies know where to route requests, and how our DNS servers know what names to give out.

jen203y ago

They call out THEIR USAGE of Consul as a source of trouble. This is quite different.

markthethomas3y ago

As an aside, it's also taking down some decently-load-bearing web infra like unpkg => https://www.unpkg.com/

zachallaun3y ago

Relevant response from the Fly community forums: https://community.fly.io/t/frequent-outages-is-really-demons...

markthethomas3y ago

zachallaun3y ago

1 more reply

ericpauley3y ago

Wow, part of Delaware’s tax website was hanging on unpkg today, now I know why!

markthethomas3y ago

(unpkg seems to be up now)

pawelduda3y ago

At least they're transparent about their issues, gotta give them that. I still kinda root for them, maybe they'll make a comeback.

mrcwinn3y ago

mrkurt3y ago

Thank you! I'm both sorry it didn't work out (because $$$$) and also glad we didn't create any agony for you. Someday, we hope to create mild irritation for you, though, if we can.

atonse3y ago

I am feeling similarly. We’ve got a few apps in fly and have convinced devs to use it for their side projects. We’re excited about the promise of fly and were considering the HIPAA plan.

Wishing you guys the best. We’ll still use fly for QA until a few of these issues are sorted out. And until there’s fully managed pg (first party or third party)

1 more reply

mrcwinn3y ago

Thanks for replying and saying that.

drewbug013y ago

I love this update:

_This is not ideal._

gzer03y ago

Interestingly, Roblox went down for 73 hours due to a "unique" issue with Consul as well [1].

Great read on how the issue was approached, handled, and ultimately remediated.

[1] https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

jeremyjh3y ago

throwdbaaway3y ago

https://github.com/hashicorp/consul/pull/12080 - this should be the Consul issue that brought down Roblox

felixding3y ago

Was affected by the outage. Didn't know about it so I thought it was just another crash on Fly.io.

Sigh.

mrkurt3y ago

satvikpendem3y ago

At this point I'm not sure why one wouldn't use something like Hetzner and slap Coolify or Dokku or something else on it.

Benjamin_Dobell3y ago

satvikpendem3y ago

I'm waiting for Coolify's Kubernetes support, personally, I'd love to use it as a pseudo-managed service while still having much lower costs and higher uptime.

alxmng3y ago

This. It’s way more performant too, because you can host DB and other services from the same machine.

You don’t need to orchestrate a complex cluster to serve thousands or even millions of users. You can scale to hundreds of gigs of memory on a single machine nowadays.

rtpg3y ago

Ops from scratch are annoying compared to the theoretical niceness of just pushing up a docker image.

satvikpendem3y ago

overbytecode3y ago

satvikpendem3y ago

Yeah I mean the host level maintenance isn't really a big issue, I can already stick my app somewhere and have it run, after the initial host setup. Maintenance afterwards is also pretty minimal.

kbumsik3y ago

I have seen some issues around Consul these days.

As a person with no background in distributed systems, I am wondering why people choose Consul over alternatives. Are there features that etcd doesn't offer?

mrkurt3y ago

We chose Nomad and adopted Consul as a result. Nomad and Consul work well together.

aeyes3y ago

Most people only use etcd at small scale. If you try to store 10 or even 100GB in etcd you are going to run into uncommon problems.

Most people don't even know that the Kubernetes control plane by default has a hard limit on etcd size. It used to be 2GB, not sure what it is now.

mdaniel3y ago

1 more reply

ivzhh3y ago

ByteDance replaces etcd with kubebrain [1], which is backed by their own KV store (TiKV seems also supported).

The single-group raft is the hard limit.

[1]: https://github.com/kubewharf/kubebrain

1 more reply

kbumsik3y ago

Doesn't Consul have the similar storage limit btw?

I have seen very few strongly consistent distributed KV store that scales beyond 10GB+

1 more reply

dilyevsky3y ago

I’ve run it with over 50G under heavy load (10k+ qps) and it was fine. It’s pretty sensitive to disk latency though

grrdotcloud3y ago

Raft is amazing and totally frustrating.

I think I understand how you're using it and curious if you've considered how AWS STS API manages their cross region syncing gets solved.

kbumsik3y ago

Thanks for the answer!

AFAIK doesn't Consul also use Raft?

pcthrowaway3y ago

Etcd is really only for basic config.

If you want apps to discover each other and be able to communicate effortlessly, even across datacenters, Consul, in theory, enables this.

I say in theory because I couldn't get federated Consul actually working.

convolvatron3y ago

discovery isn't that hard a problem that you should cede your agency to a external party like Hashicorp

throwaway3838g3y ago

mrkurt3y ago

There are really only a few frameworks where our experience approaches Heroku. And even for those, it's only the newest versions. Phoenix, Rails, Laravel, and Remix are all pretty seamless to launch.

Most others require pretty decent Docker knowledge.

HL33tibCe73y ago

Respect to anybody who is an SRE at fly.io. Couldn’t pay me enough to do that job

abledon3y ago

Markdowns forged in god-steel coming out from this incident

abofh3y ago

They just hired their first if I recall correctly. I feel for their customers more than I do for their shareholders

mrkurt3y ago

We've scaled infra ops from 3 to 7 people in the past few weeks. Our very first VP was a VP Infra Ops, because that's the thing we have to get best at to succeed as a business.

Note that we grew the whole company from 25 to 60 over the last six months.

rtpg3y ago

No offense but it's kind of wild to me that y'all had 3 infra operations people out of 60 hires.

1 more reply

aeyes3y ago

You might want to slow down on hiring, more people doesn't equal to solving the problem faster or better. It could be better to queue new sign ups to your service for a while, even if it's painful.

abofh3y ago

You've been around for a decade, I'm not giving you pity points. Can you provide what we want or not?

1 more reply

sergiomattei3y ago

I’m rooting for Fly. I use them myself for a project, and love the service.

However, even with these flaws, I still think they’re building the best hosting out there. They’re taking bold risks and doing what others aren’t. I wish them the best.

mcsniff3y ago

> they’re building a reputation for unreliable software

This is a terrific way to word what might be happening unconsciously.

pm903y ago

> We are working to build a new Consul cluster with 10x the RAM.

Oh boy. I wouldn’t wanna be the people doing this. Working with infrastructure is hard. Doing it under tight SLAs? Ugh. I really hope the people working on this are being well supported.

mrkurt3y ago

throwdbaaway3y ago

I don't really know him, but from what I can tell, https://github.com/wjordan is at least equivalent to 2.0 people.

mrkurt3y ago

Accurate.

capableweb3y ago

1. fly.io SLA only covers users on the Enterprise plan

js4ever3y ago

That's the issue with centralized infra... I expect it to be less and less stable the more customers they have. I still wish them good luck.

On my side I took the opposite direction, each workload is shared nothing.

Thaxll3y ago

They seem to have a lot of issues with Consul, is it the design of Consul or the way they use it that is the problem?

sidlls3y ago

jen203y ago

And yet the comment from a member of the actual team in question underneath says the opposite…

mrkurt3y ago

berkle44553y ago

Roblox had a massive 3-day outage [1] in October 2021 due to a Consul feature that didn’t work as expected.

My gut with Consul is don’t use it for high-load distributed services.

[1] https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

jimmyl023y ago

The Roblox outage seemed like a pretty one-off instance due to a hard to catch bug. Consul still seems like a great choice and it looks like Roblox continues to use it at their scale.

HL33tibCe73y ago

The latter (they openly admit as such)

pcthrowaway3y ago

They say that, but they're also being actively supported by Hashicorp right now (one would presume), so they really need to maintain a good working relationship.

fastest9633y ago

To add a data point we've been using Consul globally for several years now without any major outages. We do close to 50k qps with Consul at peak running on single digit cores per DC.

1 more reply

jen203y ago

Out of interest why would one presume that they are being actively supported? I haven’t read everything about this saga, but I’ve never seen any mention of a commercial relationship.

1 more reply

abofh3y ago

It's atlassian from Arkansas, just faster

akerl_3y ago

Can you cite this? Because it seems like the opposite is true: https://news.ycombinator.com/item?id=35048318

abofh3y ago

You seem to have cited exactly why I don't recommend my clients use a hashi stack, it seems like you've failed to make a point?

1 more reply

seabrookmx3y ago

Why not both?

simonw3y ago

"This impacts queries to our API, including creating and modifying apps, as well as incoming network requests for recently deployed apps."

Would be really interested to understand why it affects recently deployed apps but not apps that are already established - something to do with how the Fly Router works?

mrkurt3y ago

pa7ch3y ago

From my experience etcd would have been a better choice for maturity if they don't need the gossip stuff.

beoberha3y ago

This shit is hard. Running a cloud service at one of the Big 3 is hard, I can’t imagine doing it with such a small team with your own infra.

j / k navigate · click thread line to collapse