There's nothing to brag about here, I just wanted to let y'all know we're listening (even when things aren't on the HN front page).
Hang in there. You all will learn from this and be better for it. Your architecture will improve. Customers will give you a second chance. This too shall pass.
Sending positive vibes.
Shame cuz we were excited about our nomad+consul+vault setup and invested a lot of money into building it. But just didn’t have the time or enough depth of expertise to babysit it.
Still love using Fly, please add static assets hosting/CDN.
It's either very smart (if they pull it off) because they will have a ginormous cost advantage or they fail.
I'm personally of the opinion that the ux on top of aws/gcp/... is worse than a doo-doo in a shoe. However, they are as stable as can be (all complex systems go down once in a while). There are very few mature projects that do not rely on aws/gcp/... managed services anyway. Might as well put in the little bit of effort to set yourself up for the future instead of painful migrations. This obviously doesn't hold for hobby projects.
In any case, I have a lot of respect for the engineering that fly does. Kudos.
AWS isn’t perfect but these lessons were learned by fire because these sorts of global outages can seriously harm reputations.
They even specifically call out Consul as a source of trouble.
> We propagate app instance and health information across all our regions. That’s how our proxies know where to route requests, and how our DNS servers know what names to give out.
> We started out using HashiCorp Consul for this. But we were shoehorning Consul, which has a centralized server model design for individual data center deployments, into a global service discovery role it wasn’t suited for. The result: continuously stale data, a proxy that would route to old expired interfaces, and private DNS that would routinely have stale entries.
As an aside, it's also taking down some decently-load-bearing web infra like unpkg => https://www.unpkg.com/
At least they're transparent about their issues, gotta give them that. I still kinda root for them, maybe they'll make a comeback.
“ We are working to build a new Consul cluster with 10x the RAM. We aren't yet sure, but believe a routine DNS change might have created a thundering herd problem causing Consul servers to immediately increase RAM usage by 500%. This is not ideal.”
_This is not ideal._
Great read on how the issue was approached, handled, and ultimately remediated.
[1] https://blog.roblox.com/2022/01/roblox-return-to-service-10-...
Tried to restart our app from the command line, only to be told they had disabled the API. And there is no restart feature on their dashboard. So all I could do was watching flyio logs telling me that our apps were down.
Sigh.
We moved from Heroku to Fly.io only this January, and are already considering moving away from it. The reliability is miserable at best. And so many basic features are missing. Yes it's much cheaper than Heroku, but we ended up paying much more time/resource/money dealing with its glitches. Defeats the purpose why we used a PaaS in the first place.
[1] We're using so little infra at present that we're within their free usage tier. However, I want to clarify that this isn't because we aren't willing to pay, we specifically want to pay for reliable managed offerings. That's actually the entire point! If Fly.io can deliver on their vision, we'd gladly be billed at 100x the current usage rates.
You don’t need to orchestrate a complex cluster to serve thousands or even millions of users. You can scale to hundreds of gigs of memory on a single machine nowadays.
Though I think a lot of this is incidental to just not really knowing the deal, and ops from scratch mean you have to make a lot of tiny decisions like "OK how do I get this package over here, how do I set it up, do I wipe the VM on OS-level udpates, do I need scripts for resetting the machine..." Having pre-made decisions for a bunch of questions means you aren't spending a bunch of time on tedious stuff when starting up a project.
As a person with no background in distributed systems, I am wondering why people choose Consul over alternatives. Are there features that etcd doesn't offer?
I don't believe etcd would have been any better for us, though. Centralized service discovery that runs through raft consensus doesn't make a lot of sense for the things we need to do. And when I've had etcd blow up on me in the past, it's been similarly painful to recover from.
Most people don't even know that the Kubernetes control plane by default has a hard limit on etcd size. It used to be 2GB, not sure what it is now.
I think I understand how you're using it and curious if you've considered how AWS STS API manages their cross region syncing gets solved.
AFAIK doesn't Consul also use Raft?
If you want apps to discover each other and be able to communicate effortlessly, even across datacenters, Consul, in theory, enables this.
I say in theory because I couldn't get federated Consul actually working.
I used consul for a clustered service once, it was worth it for bringup. but I when I had problems I just wrote one in a couple days since I'd done so several times before. and it didn't fail for all the years that product was running.
Most others require pretty decent Docker knowledge.
Note that we grew the whole company from 25 to 60 over the last six months.
However, their transparency into outages and service rough edges is a double-edged sword: they’re building a reputation for unreliable software. It’s a shame to see this major outage happen right after last week’s post, it almost confirms the stereotype.
However, even with these flaws, I still think they’re building the best hosting out there. They’re taking bold risks and doing what others aren’t. I wish them the best.
This is a terrific way to word what might be happening unconsciously.
Fly posts about how hard things are during and after service outages -- while I also love the transparency, most people don't want to 'be a passenger on a plane that's being built while it's flying' especially when it comes to their business, myself included.
Oh boy. I wouldn’t wanna be the people doing this. Working with infrastructure is hard. Doing it under tight SLAs? Ugh. I really hope the people working on this are being well supported.
2. The SLA fly.io has commits to 99.9% uptime, meaning they can "afford" ~1.5m downtime daily, or ~40m monthly. AWS "offers" 99.99% (~4m monthly) if I recall correctly, but their scale is also wildly different obviously.
On my side I took the opposite direction, each workload is shared nothing.
My gut with Consul is don’t use it for high-load distributed services.
[1] https://blog.roblox.com/2022/01/roblox-return-to-service-10-...
I don't have a relationship with Hashicorp, and have tried using Consul. Everything about it is amazing in theory, but you might need a few years of experience with kube, consul, go, and maybe even the hashicorp stack to even begin debugging when things don't work as advertised.
I still think my company is going to take another stab at consul in the future, because we do need service discovery. But they're advertising a solution to an incredibly hard problem with a shit ton of variations in network topology and infra that it should (theoretically) work on. I imagine if you stay on the happy path everything works out just fine with Consul (even then, maybe only most of the time). The problem is that they don't spell out what the happy path is, and that all the other knobs they expose off to the side are actually down paths beleagured by dragons.
It's atlassian from Arkansas, just faster
Would be really interested to understand why it affects recently deployed apps but not apps that are already established - something to do with how the Fly Router works?
This outage prevented us from writing services to Consul, so we couldn't read them back out. Nomad will only really write service information to Consul, so we're kind of stuck with Consul in the loop until we're fully off Nomad.