Source https://puck.nether.net/pipermail/outages/2020-August/013229...
Its a super useful tool if you want to blast out an ACL across your network in seconds (using BGP) but it has a number of sharp edges. Several networks, including Cloudflare have learned what it can do. I've seen a few networks basically blackhole traffic or even lock themselves out of routers due to a poorly made Flowspec rules or a bug in the implementation.
Edit: if you are a Level3 customer shut your sessions down to them.
There was a huge AT&T outage in 1990 that cut off most US long distance telephony (which was, at the time, mostly "everything not within the same area code").
It was a bug. It wasn't a reconvergence event, but it was a distant cousin: Something would cause a crash; exchanges would offload that something to other exchanges, causing them to crash -- but with enough time for the original exchange to come back up, receive the crashy event back, and crash again.
The whole network was full of nodes crashing, causing their peers to crash, ad infinitum. In order to bring the network back up, they needed to either take everything down at the same time (and make sure all the queues are emptied), but even that wouldn't have made it stable, because a similar "patient 0" event would have brought the whole network down.
Once the problem was understood, they reverted to an earlier version which didn't have the bug, and the network re-stabilized.
The lore I grew up on is that this specific event was very significant in pushing and funding research into robust distributed systems, of which the best known result is Erlang and its ecosystem - originally built, and still mostly used, to make sure that phone exchanges don't break.
[0] https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...
This was covered in a book (perhaps Safeware, but maybe another one I dont recall) along with the Therac 25, the Ariane V, and several others. Unfortunately these lessons need to be relearned by each generation. See the 737-Max...
I can't remember where i read about this, but i recall the problem was called "The Creeping Crud from California". Sadly, this phrase apparently does not appear anywhere on the internet. Did i imagine this?
not expert but erlang is listed as 1986, so that would seem not directly related https://en.wikipedia.org/wiki/Erlang_(programming_language)
Example: https://mobile.twitter.com/TeliaCarrier/status/1300074378378...
If this is indeed a reconvergence event, that would imply there’s been a cascade of route table updates that have been making their way through CTL/L3’s network - meaning many routers are missing the “correct” paths to prefixes and traffic is not going where it is supposed to, either getting stuck in a routing loop or just going to /dev/null because the next hop isn’t available.
This wouldn’t be such a huge issue if downstream systems could shut down their BGP sessions with CTL and have traffic come in via other routes, but doing so is not resulting in the announcements being pulled from the Level 3 AS - something usually reflective of the CPU on the routers being overloaded processing route table updates or an issue with the BGP communication between them.
Convergence time is a known bugbear of BGP.
For each IP range described in the rumor table, each network is free to choose whichever rumor they like best among all they have heard, and send traffic for that range along the described path. Typically this is the shortest, but it doesn't have to be.
ISPs will pass on their favorite rumor for each range, adding themselves to the path of networks. (They must also withdraw the rumors if they become disconnected from their upstream source, or their upstream withdraws it.) Business like hosting providers won't pass on any rumors other than those they started, as no one involved wants them to be a path between the ISPs. (Most ISPs will generally restrict the kinds of rumors their non ISP peers can spread, usually in terms of what IP ranges the peer owns.)
Convergence in BGP is easy in the "good news" direction, and a clusterfuck in the "bad news" direction. When a new range is advertised, or the path is getting shorter, it is smooth sailing, as each network more or less just takes the new route as is and passes it on without hesitation. In the bad news direction, where either something is getting retracted entirely, or the path is going to get much longer, we get something called "path hunting."
As an example of path hunting: Lets say the old paths for a rumor were A-B-C and A-B-D, but C is also connected to D. (C and D spread rumors to each other, but the extended paths A-B-C-D and A-B-D-C are longer, thus not used yet.) A-B gets cut. B tells both C and D that it is withdrawing the rumor. Simultaneously D looks at the rumor A-B-C-D and C looks at the rumor A-B-D-C, and say "well I've got this slightly worse path lying around, might as well use it." Then they spread that rumor to their down streams not realizing that it is vulnerable to the same event that cost them the more direct route. (They have no idea why B withdraw the rumor from them.) The paths, especially when removing an IP range entirely, can get really crazy. (A lot of core internet infrastructure uses delays to prevent the same IP range from updating too often, which tamps down on the crazy path exploration and can actually speed things up in these cases.)
IP network routing is distributed systems within distributed systems. For whatever reason the distributed system that is the CenturyLink network isn't "converging", or we could it becoming consistent, or settling, in a timely manner.
I had this problem two years ago while I was taking Go lessons online from a South Korean professional Go Master. For my last job we were renting a home well outside city limits in Illinois and our Internet failed often. I lost one game in an internal teaching tournament because of a failed connection, and jumped through hoops to avoid that problem.
Wasn't able to access HN from India earlier, but other cloudflare enabled services were accessible. I assume several Network Engineers were woken up from their Sunday morning sleep to fix the issue; if any of them is reading this, I appreciate your effort.
The games are timed and this pause gives a lot of thinking time. If they're allowed to talk with others during the pause, then also consulting time.
> why don't they start over
That would be unfair to the player who was ahead.
That said, both players might still be fine with a clean rematch, because being the undisputed winner feels better. I wonder if they were asked (anonymously to prevent public hate) whether they would be fine with a rematch.
>To use the old Internet as a “superhighway” analogy, that’s like only having a single offramp to a town. If the offramp is blocked, then there’s no way to reach the town. This was exacerbated in some cases because CenturyLink/Level(3)’s network was not honoring route withdrawals and continued to advertise routes to networks like Cloudflare’s even after they’d been withdrawn. In the case of customers whose only connectivity to the Internet is via CenturyLink/Level(3), or if CenturyLink/Level(3) continued to announce bad routes after they'd been withdrawn, there was no way for us to reach their applications and they continued to see 522 errors until CenturyLink/Level(3) resolved their issue around 14:30 UTC. The same was a problem on the other (“eyeball”) side of the network. Individuals need to have an onramp onto the Internet’s superhighway. An onramp to the Internet is essentially what your ISP provides. CenturyLink is one of the largest ISPs in the United. Because this outage appeared to take all of the CenturyLink/Level(3) network offline, individuals who are CenturyLink customers would not have been able to reach Cloudflare or any other Internet provider until the issue was resolved. Globally, we saw a 3.5% drop in global traffic during the outage, nearly all of which was due to a nearly complete outage of CenturyLink’s ISP service across the United States.
it's cool to see something large enough that the auto-healing mechanisms weren't able to handle it on their own, though shoutout to whoever was on the weekend support/SRE shift; that stuff was never fun to deal with when you were one of a few reduced staff on the weekend shifts
The problem is I don't know where to find what was going on (tried looking up live DDOS-tracking websites, "is it down or is it just me" websites, etc. I couldn't find a single place talking about this.
Is there a source where you can get instant information on Level3 / global DNS / major outages?
Outages and nanog lists are your best bet, short of being on the right IRC channels.
I'm definitely an amateur when it comes to networking stuff. At the time, the _only_ issue I had was with all of my Digital Ocean droplets. It was confusing because I was able to get to them through my LTE connection and not able to through my home ISP. I opened a ticket with DO worried that it was my ISP blocking IP addresses suddenly. It turned out to be this outage, but it was very specific. Traceroute gave some clues, but again I'm amateur and I couldn't tell what was happening after a certain point.
So yeah, I too would love a really easy to use page that could show outages like this. It would be really great to be able to specify vendors used to really piece the puzzle together.
So I guess my takeaway from this is that if the Internet seems to be down, usually the CDN providers notice. I don't know if either of the sites actually still use Fastly (I kind of forgot they existed), but I did end up reading about the Internet being broken at some scale larger than "your friend's cable modem is broken", so that was helpful.
It would be nice if we had a map of popular sites and which CDN they use, so we can collect a sampling of what's up and what's down and figure out which CDN is broken. Though in this case, it wasn't really the CDN's fault. Just collateral damage.
To learn the technical aspect of it, you can follow any network engineering certification materials or resources that delve into dynamic routing protocols, notably BGP. Inter-ISP networking is nothing but setting up BGP sessions and filters at the technical level. Why you set these up, and under what conditions is a whole different can of worms, though.
The business and political aspect is a bit more difficult to learn without practice, but a good simulacrum can be taking part in a project like dn42, or even just getting an ASN and some IPv6 PA space and trying to announce it somewhere. However, this is no substitute for actual experience running an ISP, negotiating percentile billing rates with salespeople, getting into IXes, answering peering requests, getting rejected from peering requests, etc. :)
Disclaimer: I helped start a non-profit ISP in part to learn about these things in practice.
That’s fairly expensive to do just for a hobby interest, but at least the price has came down since I last looked.
* https://www.youtube.com/watch?v=8SRjTqH5Z8M
The Network Startup Resource Center out of UOregen has some good tutorials on BGP and connecting networks owned by different folks:
NANOG also has a lot of good videos on their channel from their conferences, including one on optical fibre if you want to get into the low-level ISO Layer 1 stuff:
* https://www.youtube.com/watch?v=nKeZaNwPKPo
In a similar vein, NANOG "Panel: Demystifying Submarine Cables"
Once you understand BGP and Autonomous Systems(AS), you can then understand peering as well as some of the politics that surround it.[2]
Then you can learn more about how specific networks are connected via public route servers and looking glass servers.[3][4][5]
Probably one of the best resource though still is to work for an ISP or other network provider for a stint.
[1] https://www.oreilly.com/library/view/bgp/9780596002541/
[2] http://drpeering.net/white-papers/Internet-Service-Providers...
[3] http://www.traceroute.org/#Looking%20Glass
I tried to make it accessible to those who have only a basic understanding of home networking. Assuming you know what a router is and what an ISP is, you should be able to to ingest it without needing to know crazy jargon.
Many of the comments here presume knowledge about this stuff, and I can’t follow.
Geoff Huston paper "Interconnection, Peering, and Settlements" is older, but still interesting and several ways relevant.
I suggest "Where Wizards Stay Up Late: The Origins Of The Internet" - generic and talks about Internet history, but mentions several common misconseptions.
3 sth-cr2.link.netatonce.net (85.195.62.158)
4 te0-2-1-8.rcr51.b038034-0.sto03.atlas.cogentco.com
5 be3530.ccr21.sto03.atlas.cogentco.com (130.117.2.93)
6 be2282.ccr42.ham01.atlas.cogentco.com (154.54.72.105)
7 be2815.ccr41.ams03.atlas.cogentco.com (154.54.38.205)
8 be12194.ccr41.lon13.atlas.cogentco.com (154.54.56.93)
9 be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)
10 be2315.ccr31.bio02.atlas.cogentco.com (154.54.61.113)
11 be2113.ccr42.atl01.atlas.cogentco.com (154.54.24.222)
12 be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)
13 be2027.ccr22.mia03.atlas.cogentco.com (154.54.86.206)
14 be2025.ccr22.mia03.atlas.cogentco.com (154.54.47.230)
15 * level3.mia03.atlas.cogentco.com (154.54.10.58)
16 * * *
17 * * *So as other providers shut down their links to Centurylink to save themselves the outgoing packets towards centurylink travel to some part of the world where links are not shut down yet.
It would be really cool and useful to have an "public Internet health monitoring center"... this could be a foundation that gets some financing from industry that maintains a global internet health monitoring infrastructure and a central site at which all the major players announce outages. It would be pretty cheap and have a high return on investment for everybody involved.
https://puck.nether.net/mailman/listinfo/outages
Public archives:
https://puck.nether.net/pipermail/outages/
Latest issue reported:
https://puck.nether.net/pipermail/outages/2020-August/013187... "Level3 (globally?) impacted (IPv4 only)"
Just have a site fetch resources from every single hosting provider everywhere. A 1x1 image would be enough, but 1K/100K/1M sized files might also be useful (they could also be crafted images)
The first step would be making the HTML page itself redundant. Strict round robin DNS might work well for that.
But yeah, moderately expensive - and... thinking about it... it'll honestly come in handy once every ten years? :/
Also the Internet has lots of asymmetric traffic, just because a forward path towards a destination may look the same from different networks, it doesn't mean the reverse path will be similar.
I first thought I had broken my DNS filter again through regular maintenance updates, then I suspected my ISP/modem because it regularly goes out. I have never seen the behavior I saw this morning: some sites failing to resolve.
When something doesn't work I always assume it's a problem with my device/configuration/connection.
Who would have thought it's a global event such as the repeated Facebook SDK issues.
It kind of makes it hard to route around an upstream, if they keep announcing your routes even when there isn't a path to you!
In this case however, it seems to be an L3/CL-specific bug.
Edit: Looks like i would have guessed wrong :P. Still want that inside scoop!
In some ways I'm a little bit disappointed it's only a glitch in the internet.
[1] https://web.stanford.edu/class/msande91si/www-spr04/readings...
Usually the Internet is a bit more resilient to these kinds of things, but there are complicating factors with this outage making it worse.
Expect it to mostly be resolved today. These things have happened a bit more frequently, but generally average up to a couple times a year historically.
The one site I can't see is Twitter. (Not a heart-wrenching loss, mind you...)
Routing to a level3 ISP I have an office in in the states peers with London15.Level3.net
No problem to my Cogent ISP in the states, although we don't peer directly with Cogent, that bounces via Telia
Going east from London, a 10 second outage at 12:28:42 GMT on a route that runs from me, level3, tata in India, but no rerouting.
An ssh tunnel through OVH/gravelines is working so far. edit: Proximus. edit2: also, Orange Mobile
7 166-49-209-132.gia.bt.net (166.49.209.132) 9.877 ms 8.929 ms
166-49-209-131.gia.bt.net (166.49.209.131) 8.975 ms
8 166-49-209-131.gia.bt.net (166.49.209.131) 8.645 ms 10.323 ms 10.434 ms
9 be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130) 95.018 ms
be3487.ccr41.lon13.atlas.cogentco.com (154.54.60.5) 7.627 ms
be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130) 102.570 ms
10 be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197) 89.867 ms
be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130) 101.469 ms 101.655 ms
11 be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106) 103.990 ms 93.885 ms
be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197) 97.525 ms
12 be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158) 106.027 ms
be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106) 98.149 ms 97.866 ms
13 be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70) 120.558 ms 122.330 ms 120.071 ms
14 be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70) 123.662 ms
be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222) 128.351 ms
be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70) 120.746 ms
15 be2929.ccr31.phx01.atlas.cogentco.com (154.54.42.65) 145.939 ms 137.652 ms
be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222) 128.043 ms
16 be2930.ccr32.phx01.atlas.cogentco.com (154.54.42.77) 150.015 ms
be2940.rcr51.san01.atlas.cogentco.com (154.54.6.121) 152.793 ms 152.720 ms
17 be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33) 152.881 ms
te0-0-2-0.rcr11.san03.atlas.cogentco.com (154.54.82.66) 153.452 ms
be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33) 152.054 ms
18 te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70) 162.835 ms
te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190) 146.643 ms
te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70) 153.714 ms
19 te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190) 151.212 ms 145.735 ms
38.96.10.250 (38.96.10.250) 147.092 ms
20 38.96.10.250 (38.96.10.250) 149.413 ms * *You can use `-q 1` to send a single traceroute probe/query instead of the default 3, it might make your traceroute look a little cleaner.
Got questioning if I really disconnected it before I left.
I'm wondering if we're at the point where internet outages should have some kind of (emergency) notification/sms sent to _everyone_.
AS3356 is Level 3, AS209 is CenturyLink.
https://mailman.nanog.org/pipermail/nanog/2020-August/209359...
Fastly, HN, Reddit too.
Only Google domains are loading here.
--
That doesn't really explain the "stuck" routes in their RRs... maybe it'll make sense once we've gotten some more details...
Their console isn't responding at all and all my servers are unreachable. Their status console reports all normal though.
Hacker news has been off for several hours for me.
Whatever it was it must have been nasty.
Connecting from Switzerland.
Edit: Somewhere between 14:00 and 14:46Z it also went down from O2; XS4ALL still works, and O2 can reach XS4ALL.
Doesn't work via Bouygues 4G.
My SFR fiber doesn't seem affected all that much. I've been following this for a while on the other HN post [0] and all services people have noted seem to work here.
Both SFR and Sipartech seem to have direct peerings with Cogentco.
[0] https://news.ycombinator.com/item?id=24322513
edit: Spotify seems partially down: app doesn't say it's offline, but songs won't play.
Modern society is all about consolidating systems into a few efficient solutions typically dictated by market forces which I argue, don't concern themselves much with these sorts of problems. As a result, when we run into problems, we're left with fewer options to resort to and instead have to identify problems and develop new solutions on-the-fly. Consolidation leads to complacency and stagnation.
Sometimes this is reasonable (and even desirable) for certain non-critical systems, it just doesn't make financial sense to pour resources into system diversity for certain systems we could do without--find the one that works best/most efficiently and use it. If it breaks, it's not critical and the work around can wait.
On the other hand, if a system is critical, then I think it behooves us to continue looking at improvements of existing systems and alloting resources to investigating new approaches.
Yes, because maybe so.
No, because he issue you're commenting on doesn't suggest that. It looks like the nature of this particular outage is such that a previous iteration of the Internet wouldn't have been any better equipped to solve this faster.
However, they report that they've identified the issue and are fixing it.
Why do a few companies control the backbone of the internet? Shouldn’t there be a fallback or disaster recovery plan if one or more of these companies become unavailable?
I spent too much time losing precious time when github/npm/cloudflare are going down, until I figure out it was them.
So currently working on a project[1] to monitor all the 3rd party stack you use for your services. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.
Edit: it’s up again!
Just want to let you know about the spelling error ”Save titme” :)
Congratulations on your startup!
There is at least one big tool that does exactly the same you wrote. It is called StatusGator https://statusgator.com There are at least 3 much smaller ones.
Have you tried any of them? If yes, what's your point of difference?
And how do you plan to market it? As I see the plans are cheap, means your LTV is low.
> Know when services you depend on goes down
"Services go down", not "goes".
We are applying corrective action in our data centers as the situation changes in order to improve reachability Aug 30, 14:26 UTC
I'm doing something with the HN API as I type this, so for a moment I was trying to decide if I'd been IP blocked, even though the API is hosted by Firebase.
I haven't noticed any obvious issues elsewhere yet.
(Just got a delay while trying to submit this comment.)
(I hope this doesn't mean a violent crackdown is imminent)
Oy https://mobile.twitter.com/HannaLiubakova/status/13000645356...
Incidentally, uBlock Origin seems to be completely broken. It doesn't have any local blacklists to work when their ?servers? are unavailable?
https://puck.nether.net/pipermail/outages/2020-August/thread...
Not a network engineer, but based on the comments there it looks like it's a BGP blackhole incident.
Edit: removed details about the similarity to a 1997 incident based in input from commenters.
As you aren’t a network engineer, I can understand making that leap based on the context, but no, this is nothing like the AS7007 event.
The “black hole” in this case is due to networks pulling their routes via AS3356 to try and avoid their outage, but when they do, CenturyLink is still announcing those routes and as such those networks blackhole.
I can't even access the private WoW server I play.
#OutageBenefit ;)
if every website were a house, and every house has a house number (IP address-- either IPv4 or IPv6), and a group of houses form cities and towns that can be identified by a number (AS/ Atonomous System number), the highways between cities are similar to BGP routes, and if half of the world's internet traffic goes through the city of Centurylink (AS3356),
If the city of CenturyLink (AS3356) shut down traffic, either on purpose or on accident.
...then it doesn't matter if your house number / IP address is a 32bit number or a 128bit number because traffic needs to take a different route.
This is what everyone is worried about BGP routes, not IP addresses.
Set up two hosts, host A and host B in two different data centers. Make them send HTTP requests to each other over ipv4 and over ipv6. You'll see that latency spikes, packet loss is more frequent over ipv6.
Again (https://news.ycombinator.com/item?id=24322988) not a network engineer, but it seemed like their routers actively stopped other networks from working around the problem since L3 would still keep pushing other networks' old routes, even after those networks tried to stop that.
Also: BGP probably needs to redesigned from the ground up by software engineers with experience from designing systems that can remain working with hostile actors.
This has been attempted a number of times, but this is a political problem, not a technical problem: there's no single agreed source of truth for routing policy.
A lot of US Internet providers won't even sign up for ARIN IRR, or even move their legacy space to a RIR - so there isn't even any technical way of figuring out address space ownership and cryptographic trust (ie. via RPKI). Hell, some non-RIR IRRs (like irr.net) are pretty much the fanfiction.net equivalent of IRRs, with anyone being able to write any record about ownership, without any practical verification (just have to pay a fee for write access). And for some address space, these IRRs are the only information about ownership and policy that exists.
Without even knowing for sure who a given block belongs to, or who's allowed to announce it, or where, how do you want to fix any issues with a new dynamic routing protocol?
SCION from ETH Zurich:
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
That response time is atrocious. It wasn't that they needed to fix broken hardware, rather they needed to stop running hardware from actively sabotaging the global routing via the inherently insecure BGP protocol. That took 3-4 hours to happen.
As an example: Being in Sweden with an ISP that uses Telia Carrier for connectivity things started working around the time of https://twitter.com/TeliaCarrier/status/1300074378378518528
https://twitter.com/TeliaCarrier/status/1300074378378518528?...