Level 3 Global Outage (opens in new tab)

(puck.nether.net)

914 pointsdknecht5y ago368 comments

368 comments

Summary: On August 30, 2020 10:04 GMT, CenturyLink identified an issue to be affecting users across multiple markets. The IP Network Operations Center (NOC) was engaged, and initial research identified that an offending flowspec announcement prevented Border Gateway Protocol (BGP) from establishing across multiple elements throughout the CenturyLink Network. The IP NOC deployed a global configuration change to block the offending flowspec announcement, which allowed BGP to begin to correctly establish. As the change propagated through the network, the IP NOC observed all associated service affecting alarms clearing and services returning to a stable state.

Source https://puck.nether.net/pipermail/outages/2020-August/013229...

kitteh5y ago

Flowspec strikes again.

Its a super useful tool if you want to blast out an ACL across your network in seconds (using BGP) but it has a number of sharp edges. Several networks, including Cloudflare have learned what it can do. I've seen a few networks basically blackhole traffic or even lock themselves out of routers due to a poorly made Flowspec rules or a bug in the implementation.

parliament325y ago

Is "doing what you ask" considered a sharp edge? Network-related tools don't really have safeties, ever (your linux host will happily "ip rule add 0 blackhole" without confirmation). Every case of flowspec shenanigans in the news has been operator error.

mrguyorama5y ago

It's possible that if a tool allows you to destroy everything with a single click, that tool (or maybe process) is bad

kitteh5y ago

Massive reconvergence event in their network, causing edge router bgp sessions to bounce (due to cpu). Right now all their big peers are shutting down sessions with them to give level3s network the ability to reconverge. Prefixes announced to 3356 are frozen on their route reflectors and not getting withdrawn.

Edit: if you are a Level3 customer shut your sessions down to them.

beagle35y ago

History doesn't repeat, but it rhymes ....

There was a huge AT&T outage in 1990 that cut off most US long distance telephony (which was, at the time, mostly "everything not within the same area code").

It was a bug. It wasn't a reconvergence event, but it was a distant cousin: Something would cause a crash; exchanges would offload that something to other exchanges, causing them to crash -- but with enough time for the original exchange to come back up, receive the crashy event back, and crash again.

The whole network was full of nodes crashing, causing their peers to crash, ad infinitum. In order to bring the network back up, they needed to either take everything down at the same time (and make sure all the queues are emptied), but even that wouldn't have made it stable, because a similar "patient 0" event would have brought the whole network down.

Once the problem was understood, they reverted to an earlier version which didn't have the bug, and the network re-stabilized.

The lore I grew up on is that this specific event was very significant in pushing and funding research into robust distributed systems, of which the best known result is Erlang and its ecosystem - originally built, and still mostly used, to make sure that phone exchanges don't break.

[0] https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...

phkahler5y ago

Contrary to what that link says, the software was not thoroughly tested. Normal testing was bypassed - per management request after a small code change.

This was covered in a book (perhaps Safeware, but maybe another one I dont recall) along with the Therac 25, the Ariane V, and several others. Unfortunately these lessons need to be relearned by each generation. See the 737-Max...

jacquesm5y ago

> Normal testing was bypassed - per management request after a small code change.

That lesson will really never be learned. This happens on a daily basis all over the planet with people who have not been bitten - yet.

4 more replies

pacificmint5y ago

Probably not the book you are thinking off, since it’s just about the AT&T incident, but “The Day the Phones Stopped Ringing” by Leonard Lee is a detailed description of the event.

It’s been many years since I read it, but I recall it being a very interesting read.

tinco5y ago

For some reason in my university almost every CS class would start with an anecdote about the Therac 25, Ariane V, and/or a couple others as a motivation on why we the class existed. It was sort of a meme.

The lessons are definitely still taught, I don't know if they're actually learned of course.. And who knows who actually taught the 737-Max software devs, I don't suppose they're fresh out of uni.

1 more reply

_heimdall5y ago

As a former AT&T contractor, albeit from years later, this checks out. Sat in a "red jeopardy" meeting once because a certain higher-up couldn't access the AT&T branded security system at one of his many houses.

The build that broke it was rushed out and never fully tested, adding a fairly useless feature for said higher-up that improved the UX for users with multiple houses on their account.

twic5y ago

This reminds me of an incident on the early internet (perhaps ARPANET at that point) where a routing table got corrupted so it had a negative-length route which routers then propagated to each other, even after the original corrupt router was rebooted. As with AT&T, they had to reboot all the routers at once to get rid of the corruption.

I can't remember where i read about this, but i recall the problem was called "The Creeping Crud from California". Sadly, this phrase apparently does not appear anywhere on the internet. Did i imagine this?

ficklepickle5y ago

I can't find anything by that name either, but the details do match the major ARPANET outage of Oct 27, 1980.

The incident is detailed in RFC 789:

http://www.faqs.org/rfcs/rfc789.html#b

1 more reply

peterwwillis5y ago

I have spent hours and hours banging my head against Erlang distributed system bugs in production. I am absolutely mystified why anyone thought just using a particular programming language would prevent these scenarios. If it's Turing-complete, expect the unexpected.

BoorishBears5y ago

The idea isn't that Erlang is infallible in the design of distributed systems.

The idea is it takes away enough foot-guns that if you're banging your head against systems written it in, you'd be banging your head even harder and more often if the same implementor had used another language

mcspiff5y ago

There was something similar a few years ago on a large US mobile network. You could watch the ‘storm’ rolling across the map. Fascinating stuff

throw0101a5y ago

Are you referring to CenturyLink’s 37-hour, nationwide outage?

> In this instance, the malformed packets [Ethernet frames?] included fragments of valid network management packets that are typically generated. Each malformed packet shared four attributes that contributed to the outage: 1) a broadcast destination address, meaning that the packet was directed to be sent to all connected devices; 2) a valid header and valid checksum; 3) no expiration time, meaning that the packet would not be dropped for being created too long ago; and 4) a size larger than 64 bytes.

* https://arstechnica.com/information-technology/2019/08/centu...

hinkley5y ago

I think we used to call that a poison pill message (still bring it up routinely when we talk about load balancing and why infinite retries are a very, very bad idea).

mikelward5y ago

Some queue processing systems I've seen have infinite retries.

At least they have exponential backoff I guess.

1 more reply

doctorshady5y ago

The thing with feature group D trunks to the long distance network is you could (and still can on non-IP/mobile networks) manually route to another long distance carrier like Verizon, and sidestep the outage from the subscriber end, full stop. That's certainly not possible with any of the contemporary internet outages.

p_l5y ago

you can inject changes in routing, but if the other other carrier doesn't route around the affected network, you're back to square one. That's part of why Level3/CenturyLink was depeered and why several prefixes that are normally announced through it were quickly rerouted by owners.

1 more reply

cat1995y ago

> best known result is Erlang and its ecosystem

not expert but erlang is listed as 1986, so that would seem not directly related https://en.wikipedia.org/wiki/Erlang_(programming_language)

dadver5y ago

This sounds like the event that is described in the book Masters of Deception: The gang that ruled cyberspace. The way I remember it the book attributes the incident to MoD, while of course still being the result of a bug/faulty design.

swinglock5y ago

Indeed. In 2018 an Erlang telco software did break, bringing down the UK and Japan.

pcc5y ago

If memory serves that also involved an expired certificate

1 more reply

sbmthakur5y ago

A thread discussing that event:

https://news.ycombinator.com/item?id=24323412

immigrantsheep5y ago

Is that related to the hacker's crackdown?

chrisweekly5y ago

Fascinating. Thanks for sharing! :)

kitteh5y ago

Most of level3s settlement free peers aka "tier 1s" have shutdown or depreffed their sessions with them.

Example: https://mobile.twitter.com/TeliaCarrier/status/1300074378378...

kitteh5y ago

Root cause identified. Folks are turning things back on now.

emilstahl5y ago

Source?

1 more reply

guerrilla5y ago

What is a reconvergence event? Is that what's described in your last sentence?

snuxoll5y ago

BGP is a path-vector routing protocol, every router on the internet is constantly updating its routing tables based on paths provided by its peers to get the shortest distance to an advertised prefix. When a new route is announced it takes time to propagate through the network and for all routers in the chain to “converge” into a single coherent view.

If this is indeed a reconvergence event, that would imply there’s been a cascade of route table updates that have been making their way through CTL/L3’s network - meaning many routers are missing the “correct” paths to prefixes and traffic is not going where it is supposed to, either getting stuck in a routing loop or just going to /dev/null because the next hop isn’t available.

This wouldn’t be such a huge issue if downstream systems could shut down their BGP sessions with CTL and have traffic come in via other routes, but doing so is not resulting in the announcements being pulled from the Level 3 AS - something usually reflective of the CPU on the routers being overloaded processing route table updates or an issue with the BGP communication between them.

Convergence time is a known bugbear of BGP.

mitchs5y ago

BGP operates as a rumor mill. Convergence is the process of all of the rumors settling into a steady state. The rumors are of the form "I can reach this range of IP addresses by going through this path of networks." Networks will refuse to listen to rumors that have themselves in the path, as that would cause traffic to loop.

For each IP range described in the rumor table, each network is free to choose whichever rumor they like best among all they have heard, and send traffic for that range along the described path. Typically this is the shortest, but it doesn't have to be.

ISPs will pass on their favorite rumor for each range, adding themselves to the path of networks. (They must also withdraw the rumors if they become disconnected from their upstream source, or their upstream withdraws it.) Business like hosting providers won't pass on any rumors other than those they started, as no one involved wants them to be a path between the ISPs. (Most ISPs will generally restrict the kinds of rumors their non ISP peers can spread, usually in terms of what IP ranges the peer owns.)

Convergence in BGP is easy in the "good news" direction, and a clusterfuck in the "bad news" direction. When a new range is advertised, or the path is getting shorter, it is smooth sailing, as each network more or less just takes the new route as is and passes it on without hesitation. In the bad news direction, where either something is getting retracted entirely, or the path is going to get much longer, we get something called "path hunting."

As an example of path hunting: Lets say the old paths for a rumor were A-B-C and A-B-D, but C is also connected to D. (C and D spread rumors to each other, but the extended paths A-B-C-D and A-B-D-C are longer, thus not used yet.) A-B gets cut. B tells both C and D that it is withdrawing the rumor. Simultaneously D looks at the rumor A-B-C-D and C looks at the rumor A-B-D-C, and say "well I've got this slightly worse path lying around, might as well use it." Then they spread that rumor to their down streams not realizing that it is vulnerable to the same event that cost them the more direct route. (They have no idea why B withdraw the rumor from them.) The paths, especially when removing an IP range entirely, can get really crazy. (A lot of core internet infrastructure uses delays to prevent the same IP range from updating too often, which tamps down on the crazy path exploration and can actually speed things up in these cases.)

swinglock5y ago

https://en.wikipedia.org/wiki/Convergence_(routing)

IP network routing is distributed systems within distributed systems. For whatever reason the distributed system that is the CenturyLink network isn't "converging", or we could it becoming consistent, or settling, in a timely manner.

Yajirobe5y ago

I know some of these words

rimjongun5y ago

Can you tell tell me more about what happened, but in a way that for a person who struggled with the CCNA? I’ve never heard of a reconvergence event.

emilstahl5y ago

CenturyLink/Level3 on Twitter: "We are able to confirm that all services impacted by today’s IP outage have been restored. We understand how important these services are to our customers, and we sincerely apologize for the impact this outage caused."

https://twitter.com/CenturyLink/status/1300089110858797063

ystad5y ago

I hope they provide a root cause analysis

colde5y ago

Based on experience it will probably not public, or at least very limited.

But customers are likely to get one, at least if they request it.

rootsudo5y ago

Being it was pretty big, they'll probably make it public.

jlgaddis5y ago

https://news.ycombinator.com/item?id=24324280

MLij5y ago

India just lost to Russia in the final of the firstever online chess olympiad, probably due to connection issues of two of its players. I wonder if it's related to this incident and if the organizers are aware. Edit: the organizers are aware, and Russia and India have now been declared joint winner.

mark_l_watson5y ago

I am glad they declared a tie. Seems fair.

I had this problem two years ago while I was taking Go lessons online from a South Korean professional Go Master. For my last job we were renting a home well outside city limits in Illinois and our Internet failed often. I lost one game in an internal teaching tournament because of a failed connection, and jumped through hoops to avoid that problem.

Abishek_Muthian5y ago

Thanks for the update.

Wasn't able to access HN from India earlier, but other cloudflare enabled services were accessible. I assume several Network Engineers were woken up from their Sunday morning sleep to fix the issue; if any of them is reading this, I appreciate your effort.

redwood5y ago

Interesting. How would connection issues cause them to lose? Was it a timed round?

colinbartlett5y ago

Related: World champion Magnus Carlson recently resigned a match after 4 moves as an act of honor because in his previous match with the same opponent, Magnus won solely due to his opponent having been disconnected.

repiret5y ago

His opponent, Ding Liren, is from China, and has been especially plagued by unreliable internet since all the high level chess tournaments have moved online. He is currently ranked #3, behind Magnus Carlson and Fabiano Caruana.

cyphar5y ago

All professional chess games have a time limit for each player (if you've ever heard of "chess clocks" -- that's what they're used for). In "slow chess" each player has a 2-hour limit and all of the other time control schemes (such as rapid and blitz) are much shorter.

hinkley5y ago

There’s an interesting protocol for splitting a Go or chess game over multiple days so that neither party has the entire time to think about their response to the last move: at the end of the day the final move is made by one player but is sealed, not to be revealed until the start of the next session.

For this to work on an internet competition, the judges would need a backup, possibly very low bandwidth communication mechanism that survives a network outage.

This wouldn’t save any real-time esports, but would be serviceable for turn based systems.

1 more reply

MLij5y ago

Yes, two players lost on time.

aminozuur5y ago

That's fascinating. But I wonder, why don't they start over, or continue where they left off, once the internet is back?

Strom5y ago

> continue where they left off

The games are timed and this pause gives a lot of thinking time. If they're allowed to talk with others during the pause, then also consulting time.

> why don't they start over

That would be unfair to the player who was ahead.

That said, both players might still be fine with a clean rematch, because being the undisputed winner feels better. I wonder if they were asked (anonymously to prevent public hate) whether they would be fine with a rematch.

derefr5y ago

Seems like one of those cases where solving a “little” issue would actually require rearchitecting the entire system.

Namely, in this case, it seems like the “right thing” is for games to not derive their ELO contributions from pure win/loss/draw scorings at all, but rather for games to be converted into ELO contributions by how far ahead one player was over the other at the point when both players stopped playing for whatever reason (where checkmate, forfeit, and game disruption are all valid reasons.) Perhaps with some Best-rank (https://www.evanmiller.org/how-not-to-sort-by-average-rating...) applied, so that games that go on longer are “more proof” of the competitive edge of the player that was ahead at the time.

Of course, in most central cases (of chess matches that run to checkmate or a “deep” forfeit), such a scoring method would be irrelevant, and would just reduce to the same data as win/loss/draw inputs to ELO would. So it’d be a bunch of effort only to solve these weird edge cases like “how does a half-game that neither player forfeited contribute to ELO.”

1 more reply

suby5y ago

I was doing development work which uses a server I've got hosted on digital ocean. I started getting intermittent responses which I thought weird as I hadn't changed anything on the server. I spent a good ten minutes trying to debug the issue before searching for something on duckduckgo, which also didn't respond. Cloudfare shouldn't be involved at all with my little site, so I don't think it's limited to just them.

one2know5y ago

Yeah, something happened to ipv4 traffic worldwide. Don't see how that could happen.

pps435y ago

Let me guess: somebody misconfigured BGP again?

johnisgood5y ago

https://puck.nether.net/pipermail/outages/2020-August/013198...

ra5y ago

likely

Sebb7675y ago

That's definitely going to be an interesting postmortem.

opan5y ago

Seconding this. Had some ssh connections timing out repeatedly just a bit ago. Also got disconnected on IRC.

cotillion5y ago

IKEA had their payment system go down worldwide also. I really doubt that uses Cloudflare.

Cameron_D5y ago

It's not a just CloudFlare outage, its a global CenturyLink/Level3 outage

tpmx5y ago

Is there a ranking board for which carriers have caused the most accumulated network carnage out there? I think the world deserves this.

mikegioia5y ago

Me too. I can only connect to one of my DO servers. The rest are all unreachable.

Yetanfou5y ago

As noticed in another comment I see loads of problems within Cogentco, all on *.atlas.cogentco.com. Might the problem lies there?

mikiem5y ago

Cogent and Cox are also having problems, but we are seeing a lot more successful traffic on Cogent than CenturyLink. It appears that CL is also not withdrawing stale routes. It seems CLs issues are causing issues on/with everything connected to it.

pseudoramble5y ago

Same here. I actually opened a support ticket with them because I was worried my ISP had started blocking their IP addresses for some unknown reason. Luckily it seems to clear up, and in the ticket they mentioned routing traffic away from the problematic infrastructure. Seems to have worked for now for my things.

toss15y ago

Yup, definitely noticed earlier outages to both EU sites and also to HN. Looked far upstream because many sites/lots of things worked fine. Good to see it's at least largely fixed

mercer5y ago

I had problems accessing my Hetzner VPS', but I haven't tried connecting directly with the IP. So I suppose it could be a DNS thing?

mikiem5y ago

M5 Hosting here, where this site is hosted. We just shut down 2 sessions with Level3/CenturyLink because the sessions were flapping and we were not getting complete full route table from either session. There are definitely other issues going on on the Internet right now.

exikyut5y ago

Oooh, maybe that's why HN wasn't working for me a little while ago (from AU)...

eastdakota5y ago

Analysis of what we saw at Cloudflare, how our systems automatically mitigated the worst of the impact to our customers, and some speculation on what may have gone wrong: https://blog.cloudflare.com/analysis-of-todays-centurylink-l...

ngold5y ago

Great write up. It is embarrassing that most of America has no competition in the market.

>To use the old Internet as a “superhighway” analogy, that’s like only having a single offramp to a town. If the offramp is blocked, then there’s no way to reach the town. This was exacerbated in some cases because CenturyLink/Level(3)’s network was not honoring route withdrawals and continued to advertise routes to networks like Cloudflare’s even after they’d been withdrawn. In the case of customers whose only connectivity to the Internet is via CenturyLink/Level(3), or if CenturyLink/Level(3) continued to announce bad routes after they'd been withdrawn, there was no way for us to reach their applications and they continued to see 522 errors until CenturyLink/Level(3) resolved their issue around 14:30 UTC. The same was a problem on the other (“eyeball”) side of the network. Individuals need to have an onramp onto the Internet’s superhighway. An onramp to the Internet is essentially what your ISP provides. CenturyLink is one of the largest ISPs in the United. Because this outage appeared to take all of the CenturyLink/Level(3) network offline, individuals who are CenturyLink customers would not have been able to reach Cloudflare or any other Internet provider until the issue was resolved. Globally, we saw a 3.5% drop in global traffic during the outage, nearly all of which was due to a nearly complete outage of CenturyLink’s ISP service across the United States.

throwaway3446555y ago

I remember working the support queue _before_ this automatic re-routing mitigation system went in and it was a lifesaver. Having to run over to SRE and yell "look! look at grafana showing this big jump in 522s across the board for everything originating in ORD-XX where the next hop is ASYYYY! WHY ARE WE STILL SENDING TRAFFIC OVER THAT ARRRGHH please re-route and make the 522 tickets stop"

it's cool to see something large enough that the auto-healing mechanisms weren't able to handle it on their own, though shoutout to whoever was on the weekend support/SRE shift; that stuff was never fun to deal with when you were one of a few reduced staff on the weekend shifts

lemiffe5y ago

I had this earlier! A bunch of sites were down for me, I couldn't even connect to this site.

The problem is I don't know where to find what was going on (tried looking up live DDOS-tracking websites, "is it down or is it just me" websites, etc. I couldn't find a single place talking about this.

Is there a source where you can get instant information on Level3 / global DNS / major outages?

kitteh5y ago

Ddos tracking sites are eye candy and garbage. Stop using them.

Outages and nanog lists are your best bet, short of being on the right IRC channels.

xwdv5y ago

What are the right IRC channels?

dudus5y ago

I believe these are mostly non public channels where backbone and network infrastructure engineers from different companies congregate to discuss outages like this.

3 more replies

pseudoramble5y ago

I agree!

I'm definitely an amateur when it comes to networking stuff. At the time, the _only_ issue I had was with all of my Digital Ocean droplets. It was confusing because I was able to get to them through my LTE connection and not able to through my home ISP. I opened a ticket with DO worried that it was my ISP blocking IP addresses suddenly. It turned out to be this outage, but it was very specific. Traceroute gave some clues, but again I'm amateur and I couldn't tell what was happening after a certain point.

So yeah, I too would love a really easy to use page that could show outages like this. It would be really great to be able to specify vendors used to really piece the puzzle together.

manceraio5y ago

I had a similar issue with my droplets as well. I thought I messed up something and then suddenly it worked again.

jrockway5y ago

I found places talking about this earlier. A friend of mine who has CenturyLink as their ISP complained to me that Twitch and Reddit weren't working. But they worked for me, so I suspected a CDN issue. I did some digging to figure out what CDNs they had in common. I expected Twitch to be on CloudFront, but their CDN doesn't serve CloudFront headers; instead they are "Via: 1.1 varnish". Reddit is exactly the same. I did some googling and found out that they both apparently used Fastly, at least to some extent. Fastly has a status page and it was talking about "widespread disruption".

So I guess my takeaway from this is that if the Internet seems to be down, usually the CDN providers notice. I don't know if either of the sites actually still use Fastly (I kind of forgot they existed), but I did end up reading about the Internet being broken at some scale larger than "your friend's cable modem is broken", so that was helpful.

It would be nice if we had a map of popular sites and which CDN they use, so we can collect a sampling of what's up and what's down and figure out which CDN is broken. Though in this case, it wasn't really the CDN's fault. Just collateral damage.

aosaigh5y ago

Has anyone any good resources for learning more about the "internet-level" infrastructure affected today and how global networks are connected?

q3k5y ago

Unfortunately, this infrastructure is at an uncanny intersection of technology, business and politics.

To learn the technical aspect of it, you can follow any network engineering certification materials or resources that delve into dynamic routing protocols, notably BGP. Inter-ISP networking is nothing but setting up BGP sessions and filters at the technical level. Why you set these up, and under what conditions is a whole different can of worms, though.

The business and political aspect is a bit more difficult to learn without practice, but a good simulacrum can be taking part in a project like dn42, or even just getting an ASN and some IPv6 PA space and trying to announce it somewhere. However, this is no substitute for actual experience running an ISP, negotiating percentile billing rates with salespeople, getting into IXes, answering peering requests, getting rejected from peering requests, etc. :)

Disclaimer: I helped start a non-profit ISP in part to learn about these things in practice.

bausano_michael5y ago

Judging by other comments, it seems there's a space to fill this niche with a series of blog articles or a book, if you're that sort of person.

kitteh5y ago

There are plenty of presentations out there. See nanog, ripe.

The books are meh because they're not written by operators. They're more academic and dated.

Plenty of clueful folks on the right IRC channels.

1 more reply

jsjohnst5y ago

> or even just getting an ASN and some IPv6 PA space and trying to announce it somewhere

That’s fairly expensive to do just for a hobby interest, but at least the price has came down since I last looked.

q3k5y ago

A RIPE ASN (as a end-user through a LIR) and PA v6 will cost you around $100 per year and some mild paperwork, there's plenty of companies/organizations that will help you with that (shameless plug: bgp.wtf, that's us).

Afterwards, announcing this space is probably the cheapest with vultr (but their BGP connectivity to VMs is uh, erratic at times) or with ix-vm.cloud, or with packet.net (more expensive). You can also try to colo something in he.net FMT2 and reach FCIX, or something in kleyrex in Germany. All in all, you should be able to do something like run a toy v6 anycast CDN at not more than $100 per month.

2 more replies

akritrime5y ago

What resources can I follow to start a non-profit ISP? I want to start one in my hometown for students who couldn't afford internet to join online classes.

dboreham5y ago

Why not just raise money to pay for service from for-profit providers? Much more efficient use of donation funds.

2 more replies

walrus015y ago

If you intend to start a facilities-based last mile access ISP, what last-mile tech do you intend to use? There's a number of resources out there for people who want to be a small hyper local WISP. But I would not recommend it unless you have 10+ years of real world network engineering experience at other, larger ISPs.

odensc5y ago

https://startyourownisp.com/

q3k5y ago

There's a bunch of guidelines for starting a (W)ISP depending on your region.

1 more reply

throw0101a5y ago

"An Open Platform to Teach How the Internet Practically Works" from NANOG perhaps:

* https://www.youtube.com/watch?v=8SRjTqH5Z8M

The Network Startup Resource Center out of UOregen has some good tutorials on BGP and connecting networks owned by different folks:

* https://learn.nsrc.org/bgp

NANOG also has a lot of good videos on their channel from their conferences, including one on optical fibre if you want to get into the low-level ISO Layer 1 stuff:

* https://www.youtube.com/watch?v=nKeZaNwPKPo

In a similar vein, NANOG "Panel: Demystifying Submarine Cables"

* https://www.youtube.com/watch?v=Pk1e2YLf5Uc

bogomipz5y ago

You want to learn about BGP in order to understand how routing on the internet works. The book "BGP" by Iljitsch van Beijnum is a great place to start. Don't be put off by the publication date, as almost everything in there is still relevant.[1]

Once you understand BGP and Autonomous Systems(AS), you can then understand peering as well as some of the politics that surround it.[2]

Then you can learn more about how specific networks are connected via public route servers and looking glass servers.[3][4][5]

Probably one of the best resource though still is to work for an ISP or other network provider for a stint.

[1] https://www.oreilly.com/library/view/bgp/9780596002541/

[2] http://drpeering.net/white-papers/Internet-Service-Providers...

[3] http://www.traceroute.org/#Looking%20Glass

[4] http://www.traceroute.org/#Route%20Servers

[5] http://www.routeviews.org/routeviews/

Famicoman5y ago

It likely has some inaccurate info as I'm not a network engineer, but I gave a talk about BGP (with a history, protocol overview, and information on how it fails using real world examples) at Radical Networks last year. https://livestream.com/internetsociety/radnets19/videos/1980...

I tried to make it accessible to those who have only a basic understanding of home networking. Assuming you know what a router is and what an ISP is, you should be able to to ingest it without needing to know crazy jargon.

EvanAnderson5y ago

It's important to recognize that there is a "layer 8" in Internet routing-- the political / business layer-- that's not necessarily expressed in technical discussion of protocols and practices. The BGP routing protocol is a place where you'll see "layer 8" decisions reflected very starkly in configuration. You may have networks that have working physical connectivity, but logically be unable to route traffic across each other because of business or political arrangements expressed in BGP configuration.

marmshallow5y ago

Many of the comments here presume knowledge about this stuff, and I can’t follow.

gkanai5y ago

Don't forget Neal Stephenson's classic, "Mother Earth, Mother Board." 25 years old but still relevant.

https://www.wired.com/1996/12/ffglass/

walrus015y ago

The business structures, ISP ownership and national telecoms have changed quite a lot in the past 25 years. But in terms of the physical OSI layer 1 challenges of laying cable across an ocean, that remains the most difficult and costly part of the process.

antsoul5y ago

US and Israel looking at China's strategy at BGP-level in 2018 :

https://scholarcommons.usf.edu/mca/vol3/iss1/7/

vesh5y ago

This might help https://community.fs.com/blog/tcpip-vs-osi-whats-the-differe...

ZWoz5y ago

DrPeering is good material: http://drpeering.net/tools/HTML_IPP/ipptoc.html

Geoff Huston paper "Interconnection, Peering, and Settlements" is older, but still interesting and several ways relevant.

I suggest "Where Wizards Stay Up Late: The Origins Of The Internet" - generic and talks about Internet history, but mentions several common misconseptions.

albertTJames5y ago

https://mobile.twitter.com/Level3 (not an internet level, just a company :)

Vinnl5y ago

Were their tweets protected (i.e. only visible to approved followers) when you posted that link, or is that in response to this event?

iso9475y ago

Level3 was qquired/merged/changed to century link a year or so back, I think they closed their old twitter account then

When someone says level3, read century link. L3 have been a major player for decades though (including providing the infamous 4.2.2.2 dns server), so people still refer to them as level3.

The account to follow for them now is https://mobile.twitter.com/CenturyLink but it won’t tell you much.

1 more reply

mmaunder5y ago

Read Internet Routing Architectures by Sam Halabi. It’s almost 20 years old now but BGP hasn’t changed and the book is still called The Bible by routing architects.

kitteh5y ago

It's dated and not particularly useful if you want to learn how things are really done on the internet in a practical sense. So if you read it, be prepared to unlearn a bunch of stuff.

colechristensen5y ago

I don't know something holistic, but if you are the Wikipedia rampage sort of person, here is a good place to start:

https://en.wikipedia.org/wiki/Internet_exchange_point

maxmouchet5y ago

"Tubes" is a good book to get an high level overview: https://www.penguin.co.uk/books/178533/tubes/9780141049090

TallGuyShort5y ago

No particular resource to recommend, though I first learned about it in a book by Radia Perlman, but BGP is a protocol you don't hear much about unless you work in networking, and is one of the key pieces in a lot of wide-scale outages. I'd start with that.

late2part5y ago

read the last 26 years of NANOG archives

Yetanfou5y ago

Odd, I'm trying to reach a host in Germany (AS34432) from Sweden but get rerouted Stockholm-Hamburg-Amsterdam-London-Paris-London-Atlanta-São Paulo after which the packets disappear down a black hole. All routing problems occur within Cogentco.

    3  sth-cr2.link.netatonce.net (85.195.62.158) 
    4  te0-2-1-8.rcr51.b038034-0.sto03.atlas.cogentco.com 
    5  be3530.ccr21.sto03.atlas.cogentco.com (130.117.2.93)
    6  be2282.ccr42.ham01.atlas.cogentco.com (154.54.72.105)  
    7  be2815.ccr41.ams03.atlas.cogentco.com (154.54.38.205) 
    8  be12194.ccr41.lon13.atlas.cogentco.com (154.54.56.93)   
    9  be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  
   10  be2315.ccr31.bio02.atlas.cogentco.com (154.54.61.113)  
   11  be2113.ccr42.atl01.atlas.cogentco.com (154.54.24.222)  
   12  be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)
   13  be2027.ccr22.mia03.atlas.cogentco.com (154.54.86.206)
   14  be2025.ccr22.mia03.atlas.cogentco.com (154.54.47.230)
   15  * level3.mia03.atlas.cogentco.com (154.54.10.58) 
   16  * * *
   17  * * *

cotillion5y ago

What seems to have happened is that Centurylinks internal routing has collapsed in some way. But they're still announcing all routes and they don't stop announcing routes when other ISPs tag their routes not to be exported by Centurylink.

So as other providers shut down their links to Centurylink to save themselves the outgoing packets towards centurylink travel to some part of the world where links are not shut down yet.

vld5y ago

I'm having issues reaching IP addresses unrelated to Cloudflare. Based on some traceroutes, it seems AS174 (Cogent) and AS3356 (Level 3) are experiencing major outages.

jbotz5y ago

Is there any one place that would be a good first place to go to check on outages like this?

It would be really cool and useful to have an "public Internet health monitoring center"... this could be a foundation that gets some financing from industry that maintains a global internet health monitoring infrastructure and a central site at which all the major players announce outages. It would be pretty cheap and have a high return on investment for everybody involved.

guerby5y ago

In the network world there's the outages mailing list:

https://puck.nether.net/mailman/listinfo/outages

Public archives:

https://puck.nether.net/pipermail/outages/

Latest issue reported:

https://puck.nether.net/pipermail/outages/2020-August/013187... "Level3 (globally?) impacted (IPv4 only)"

cultureulterior5y ago

https://www.thousandeyes.com/outages

tuukkah5y ago

Based on that map, Telia seems to be one of the most affected which might explain why Scandinavia is so badly affected.

thejosh5y ago

Until that site also goes down.

lioeters5y ago

Indeed, if we're to have a public Internet health meter, it must be distributed and hosted/served from "outside" somehow, to be resilient to all or parts of the network being down.

1 more reply

efreak5y ago

Something something anycast.

exikyut5y ago

This is an excellent idea and simple but moderately expensive for anyone to set up.

Just have a site fetch resources from every single hosting provider everywhere. A 1x1 image would be enough, but 1K/100K/1M sized files might also be useful (they could also be crafted images)

The first step would be making the HTML page itself redundant. Strict round robin DNS might work well for that.

But yeah, moderately expensive - and... thinking about it... it'll honestly come in handy once every ten years? :/

svdr5y ago

I go here :-)

mnadkvlb5y ago

Sounds like a good idea. The closest i know is the one from pingdom which i use the most. Its not detailed enough though. https://livemap.pingdom.com/

dexterdog5y ago

You just imagined the first target in an attack. Might as well just call it honeypotnumber1.

rewtraw5y ago

Reddit, HN, etc. are inaccessible to me over my Spectrum fiber connection, but working on AT&T 4G. It’s not DNS, so a tier 1 ISP routing issue seems to be the most likely cause.

kossTKR5y ago

Lots of local sites not working in Scandinavia either. So seems more global than a single Tier 1?

phoe-krk5y ago

Probably relevant Fastly update:

> Fastly is observing increased errors and latency across multiple regions due to a common IP transit provider experiencing a widespread event. Fastly is actively working on re-routing traffic in affected regions.

dbetteridge5y ago

HN and reddit out on my talktalk link in London, 3 mobile 4g working normally.

aeyes5y ago

Can confirm for a number of sites, even Hacker News was unreachable for me.

Benjamin_Dobell5y ago

This explains a lot. Initially thought my mobile phone Internet connectivity was flakey because I couldn't access HN here in Australia, whilst it's fine over wi-fi (wired Internet).

abhishekjha5y ago

Its reverse for me. The broadband fails to connect to HN but my mobile ISP is able to reach it fine.

josephb5y ago

Because networks are connected to others via different paths, it's not unusual that one method of connectivity would work and one doesn't.

Also the Internet has lots of asymmetric traffic, just because a forward path towards a destination may look the same from different networks, it doesn't mean the reverse path will be similar.

willis9365y ago

Same for me in midwest US.

I first thought I had broken my DNS filter again through regular maintenance updates, then I suspected my ISP/modem because it regularly goes out. I have never seen the behavior I saw this morning: some sites failing to resolve.

bmlzootown5y ago

I thought Cloudflare was having issues again, since I use their DNS servers, so I started by changing that. Then I tried restarting everything, modem/router/computer. Wasn't until I connected to a VM that a friend hosts that I was finally able to access HN, and thus saw this thread.

Hopefully this will get fixed within a reasonable timespan.

every5y ago

ycombinator.com pinged just fine but news.ycombinator.com dropped 100% packets. But all better now...

yreg5y ago

I was so pissed at Waze earlier for giving up on me in a critical moment. Then I found out I'm also unable to send iMessages, but I was curious, since I could browse the web just fine.

When something doesn't work I always assume it's a problem with my device/configuration/connection.

Who would have thought it's a global event such as the repeated Facebook SDK issues.

bonestamp25y ago

Yep, I had a similar experience. Sites that didn't work from my home connection worked fine on mobile. After rebooting and it persisted, I assumed it was just a DNS or routing issue since they were both connecting to different networks.

iso12105y ago

Looks like Centurylink/Level3 (as3356) might not be withdrawing routes after people close their peering?

josephb5y ago

That's what various networks have reported.

It kind of makes it hard to route around an upstream, if they keep announcing your routes even when there isn't a path to you!

swinglock5y ago

Quick hack; split all your announcements in two, making the new announcement route around their old stale announcement by being more specific.

regolithori5y ago

What could cause this? I wonder what the technical problem is.

q3k5y ago

These are usually called 'BGP Zombies', and here's a good summary of their prevalence and usual causes: https://labs.ripe.net/Members/romain_fontugne/bgp-zombies

In this case however, it seems to be an L3/CL-specific bug.

jcims5y ago

I would love to hear the inside scoop from folks working at CenturyLink. I’ve used their DSL for years and the network is a mess. I don’t know if it them here or legacy Level3 but i have a guess.

Edit: Looks like i would have guessed wrong :P. Still want that inside scoop!

iso12105y ago

Used level3 IP for a long time professionally with limited issues, ceratainly not on the list of worst ISPs.

Also used a company that over the years has gone from Genesis, GlobalCrossing, Vyvx, Level3 and now of course Level 3 is CenturyLink, which has been fine.

crizzlenizzle5y ago

We had this once with one of our former ISPs configuring static routes towards us and announcing them to a couple of IXPs. I have no idea why they did it, but it caused a major downtime once for us and basically signed the termination.

bregma5y ago

Misread the headline as "Level 3 Global Outrage" and thought "someone had defined outrage levels?" and "it doesn't matter, he'll just attribute it to the Deep State".

In some ways I'm a little bit disappointed it's only a glitch in the internet.

1 more reply

_eigenfoo5y ago

Can somebody please clarify - what exactly is this an outage of, and how serious is it?

rmrfstar5y ago

Here is a fantastic, though somewhat outdated overview [1]. Section 5 is most relevant to your question. The network topology today is a little different. Think of Level3 as an NSP, which is now called a "Tier 1 network" [2]. The diagram should show links among the Tier 1 networks ("peering"), but does not.

[1] https://web.stanford.edu/class/msande91si/www-spr04/readings...

[2] https://en.wikipedia.org/wiki/Tier_1_network

jsjohnst5y ago

tl;dr One of the large Internet backbone providers (formerly known as Level3, but now known as CenturyLink usually) that many ISPs use is down. Expect issues connecting to portions of the Internet.

Usually the Internet is a bit more resilient to these kinds of things, but there are complicating factors with this outage making it worse.

Expect it to mostly be resolved today. These things have happened a bit more frequently, but generally average up to a couple times a year historically.

g105b5y ago

Is this affecting all geographic regions?

dredmorbius5y ago

US, Europe, and Asia that I'm aware of (NANOG mailing list).

mikro2nd5y ago

Had to laugh: "I'm seeing complaints from all over the planet on Twitter"

The one site I can't see is Twitter. (Not a heart-wrenching loss, mind you...)

quickthrower25y ago

I could not get on HN as a logged in person (logged out was OK) during this. I wondered how big the cloudflare thread would be if people could get on to comment on it :-)

emilstahl5y ago

CNN just blames Cloudflare.. :facepalm: https://edition.cnn.com/2020/08/30/tech/internet-outage-clou...

ihatecloudflare5y ago

CNN is absolutely right. Every day I read news that something goes down at CloudFlare. CloudFlare do much more harm than they "fix" with their services.

dathinab5y ago

I guess that why HN was temporary unreachable from my home?

protomyth5y ago

and why Cloudflare was having so many issues https://www.cloudflarestatus.com/

jetru5y ago

Oh lord. I'm oncall and we were like "WHATS HAPPENING"

b3lvedere5y ago

Same here :) Couple of companies started complaining. Told them it's a worldwide issue. It seems going better at the moment.

iso12105y ago

No peering problems from my network with Level3 in London Telehouse West, maybe a minute or so of increased latency at 10:09 GMT

Routing to a level3 ISP I have an office in in the states peers with London15.Level3.net

No problem to my Cogent ISP in the states, although we don't peer directly with Cogent, that bounces via Telia

Going east from London, a 10 second outage at 12:28:42 GMT on a route that runs from me, level3, tata in India, but no rerouting.

johnchristopher5y ago

So, that's why HN is unreachable from Belgium at the moment (right when I was trying to figure a dns cache problem in Firefox,of course).

An ssh tunnel through OVH/gravelines is working so far. edit: Proximus. edit2: also, Orange Mobile

iso12105y ago

HN working for me from the UK on BT, but traceroute showing lots of different bouncing around and a lot of different hops in the US

  7  166-49-209-132.gia.bt.net (166.49.209.132)  9.877 ms  8.929 ms
    166-49-209-131.gia.bt.net (166.49.209.131)  8.975 ms
  8  166-49-209-131.gia.bt.net (166.49.209.131)  8.645 ms  10.323 ms  10.434 ms
  9  be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  95.018 ms
    be3487.ccr41.lon13.atlas.cogentco.com (154.54.60.5)  7.627 ms
    be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  102.570 ms
  10  be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197)  89.867 ms
    be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  101.469 ms  101.655 ms
  11  be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106)  103.990 ms  93.885 ms
    be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197)  97.525 ms
  12  be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)  106.027 ms
    be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106)  98.149 ms  97.866 ms
  13  be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  120.558 ms  122.330 ms  120.071 ms
  14  be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  123.662 ms
    be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222)  128.351 ms
    be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  120.746 ms
 15  be2929.ccr31.phx01.atlas.cogentco.com (154.54.42.65)  145.939 ms  137.652 ms
    be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222)  128.043 ms
  16  be2930.ccr32.phx01.atlas.cogentco.com (154.54.42.77)  150.015 ms
    be2940.rcr51.san01.atlas.cogentco.com (154.54.6.121)  152.793 ms  152.720 ms
  17  be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33)  152.881 ms
    te0-0-2-0.rcr11.san03.atlas.cogentco.com (154.54.82.66)  153.452 ms
    be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33)  152.054 ms
  18  te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70)  162.835 ms
    te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190)  146.643 ms
    te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70)  153.714 ms
  19  te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190)  151.212 ms  145.735 ms
    38.96.10.250 (38.96.10.250)  147.092 ms
  20  38.96.10.250 (38.96.10.250)  149.413 ms * *

josephb5y ago

Guessing the traceroute looks a bit messy because of multiple paths being available.

You can use `-q 1` to send a single traceroute probe/query instead of the default 3, it might make your traceroute look a little cleaner.

iso12105y ago

I don't normally see multi paths for a given IP, but that aside it's bouncing through far more than I'd expect. That said, it's rare I look at traceroutes across the continental U.S, maybe that many layer 3 hops are normal, maybe routes change constantly.

HN has dropped off completely from work - I see the route advertised from Level 3 (3356 21581 21581) and from Telia and onto Cogent (1299 174 21581 21581). Telia is longer, so traffic goes into to Level3 at Docklands via our 20G peer to London15, but seems to get no further.

Heading to Tata in India, route out is via same peer to level3, then onto the London, Marseile, and then peers with Tata in Marseille, working fine.

My gut feeling is a core problem in Level3's continental US network rather than something more global.

1 more reply

sgt5y ago

It was also down from South Africa. It's luckily up now. Gasps for breath

kuroguro5y ago

Was down from Latvia too, up now.

wiremine5y ago

In a situation like this, what are the best "status" sites to be watching?

OskarS5y ago

HN is not the worst place, honestly.

Timothycquinn5y ago

Agreed. I went to Reddit r/networking and the mods were closing helpful threads in real-time :(

innocenat5y ago

HN was down for me, unfortunately. (Connecting from Japan, so most CDN-based website load fine since it isn't route via Europe)

ojagodzinski5y ago

https://downdetector.com/ client perspective is best perspective ;) Problem in this outage is that site X works ok but transit provider for clients in US works badly and generates "false positives"

traceroute665y ago

For a situation like this, the various tools hosted by RIPE are likely your best bet. You won't get a pretty green/red picture, but you'll get a more than enough data to work with.

_8j505y ago

stat.ripe.net

dkdk82835y ago

Nanog is also pretty helpful for this specific type of issue

zowanet5y ago

Here's a direct link to this month's messages:

https://mailman.nanog.org/pipermail/nanog/2020-August/thread...

lucb1e5y ago

You mean nanog.org? I don't see a stats page linked in their menu.

toomuchtodo5y ago

It’s a mailing list for network operations/engineering folks. The emails are the status updates. You’ll have to look to each network’s own site if you want connectivity, peering, and IXP red/green/ up/down status.

quickthrower25y ago

Ham radio might be the answer to this one day!

tn8905y ago

https://www.internetweathermap.com/map

tillinghast5y ago

Except for the fact that internetweathermap.com is super green, and the internet is not currently super green.

eric_khun5y ago

Currently working on a project[1] to monitor all the 3rd party stack you use for your services. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.

[1] https://monitory.io

stevekemp5y ago

Your front page has a typo: "titme".

Since hacker news was down yesterday I couldn't reply here, so I tried to send you an email, but that failed to deliver, as there are no MX records for monitory.io...

gnyman5y ago

This had me really confused until I saw it was a global outage. I have been getting delayed iOS push notifications (from prowl) now for the last few hours, from a device I was fairly sure I had disconnected 3 hours ago (a pump)

Got questioning if I really disconnected it before I left.

I'm wondering if we're at the point where internet outages should have some kind of (emergency) notification/sms sent to _everyone_.

dredmorbius5y ago

NANOG are talking about a CenturyLink outage and BGP flapping (AS 3356) as of 03:00 US/Pacific, AS209 possibly also affected.

AS3356 is Level 3, AS209 is CenturyLink.

https://mailman.nanog.org/pipermail/nanog/2020-August/209359...

ffpip5y ago

DDG, down detector are all very slow. Both are on cloudflare.

Fastly, HN, Reddit too.

Only Google domains are loading here.

thejteam5y ago

From where I am (mid-altantic US) Google site are completely down (google.com, youtube)

jlgaddis5y ago

> "Root Cause: An offending flowspec announcement prevented BGP from establishing correctly, impacting client services."

That doesn't really explain the "stuck" routes in their RRs... maybe it'll make sense once we've gotten some more details...

quickthrower25y ago

This might be a silly question but is there such a thing as CI/CD for this sort of thing that may have caught the problem?

dsr_5y ago

There are two aspects to this:

1. Is there syntax correctness checking available, so you don't push a config that breaks machines? Yes.

2. Is there a DWIM check available, so you can see the effect of the change before committing? No. That would require a complete model of, at a minimum, your entire network plus all directly connected networks -- that still wouldn't be complete, but it could catch some errors.

based25y ago

https://status.ctl.io/history/f19a0555-abbd-4038-91cb-b55a76...

https://twitter.com/g_bonfiglio/status/1300022993251446785?s...

https://old.reddit.com/r/networking/comments/ijb8tn/global_a...

blantonl5y ago

Everything to Oracle Cloud's Ashburn US-East location is down.

Their console isn't responding at all and all my servers are unreachable. Their status console reports all normal though.

system25y ago

Status pages of the companies are just PR disasters for them. Most of the time they don't report what's up.

tyfon5y ago

Seems like "the internet" works again here in Norway. I've been limited to local sites all day.

Hacker news has been off for several hours for me.

Whatever it was it must have been nasty.

djxfade5y ago

I had the same issue on my fiber connection (Altibox/BKK), however, no problems on my mobile using 4G (Dipper/Telenor)

matsemann5y ago

I couldn't reach HN on neither Altibox or 4g/telenor.

tyfon5y ago

Both altibox and telia 4g was down for me as well.

janmo5y ago

There is a major internet outage going on. I am using Scaleway they are also affected. According to Twitter, Vodafone, CityLink and many more are also affected.

gailees5y ago

The beginning of WWIII probably looks something like this.

vbsteven5y ago

I'm having lots of issues with Hetzner machines not being available (and even the hetzner.com website). Don't know if this is related.

zepearl5y ago

Fyi I'm not having any problems right now with hetzner.com nor hetzner.de - my own dedicated server hosted at Hetzner datacenter in Germany seems to be reachable/working as well.

Connecting from Switzerland.

vinni25y ago

I had to use a VPN With US location to post this comment. I am in Europe.

lucb1e5y ago

HN works fine from Germany with Telefonica (O2) and also from the Netherlands with XS4ALL.

Edit: Somewhere between 14:00 and 14:46Z it also went down from O2; XS4ALL still works, and O2 can reach XS4ALL.

minxomat5y ago

No luck on T-Mobile

omnibrain5y ago

Yes, I had to switch to my Vodafone eSIM for data to connect to Hacker News.

crizzlenizzle5y ago

Yup.

``` Prefix 209.216.230.0/24 BGP as_path 3356 21581 21581 ```

As seen from AS3320.

1 more reply

vladvasiliu5y ago

Works OK in Paris via SFR (home fiber) and Sipartech (Business fiber).

Doesn't work via Bouygues 4G.

My SFR fiber doesn't seem affected all that much. I've been following this for a while on the other HN post [0] and all services people have noted seem to work here.

Both SFR and Sipartech seem to have direct peerings with Cogentco.

[0] https://news.ycombinator.com/item?id=24322513

edit: Spotify seems partially down: app doesn't say it's offline, but songs won't play.

osipovas5y ago

A service I run on Digital Ocean was affected by this early this morning. Looks like it was mitigated by DO - so I'm very grateful for that. Although, the service I run is time sensitive so failures like this are pretty unfortunate for me. Where would I get started with building in redundancy against these sort of outages?

naringas5y ago

seems like the internet in 2020 has a diminished ability to route around damage

Frost1x5y ago

My opinion is that this, like many issues were seeing today, is largely an issue with ongoing consolidation trends. The less diversity of systems/solutions we have for a given problem or set if problems, the less chance we're protected from unknown unknowns that creep up. The more diversity you have in systems, the more likely you have some option that is hardened against unknown unknowns when they arrive and the quicker we can work around them.

Modern society is all about consolidating systems into a few efficient solutions typically dictated by market forces which I argue, don't concern themselves much with these sorts of problems. As a result, when we run into problems, we're left with fewer options to resort to and instead have to identify problems and develop new solutions on-the-fly. Consolidation leads to complacency and stagnation.

Sometimes this is reasonable (and even desirable) for certain non-critical systems, it just doesn't make financial sense to pour resources into system diversity for certain systems we could do without--find the one that works best/most efficiently and use it. If it breaks, it's not critical and the work around can wait.

On the other hand, if a system is critical, then I think it behooves us to continue looking at improvements of existing systems and alloting resources to investigating new approaches.

sp3325y ago

BGP has always had this issue. It depends on trustworthy information being available. Any trusted source who starts lying (or just screws up) is going to cause routing problems.

salawat5y ago

Note, trustworthyness jumps off of being a technical problem, and becoming a human/people problem. Level 8 as someone mentioned, or GIGO (Garbage-In-Garbage-Out) as others may know it.

To safely use a system, your operator needs to be 10% smarter than the system being operated. It is clear that we have problems in that department with certain AS's. This is about, what the third major outage attributed to CenturyLink in the last handful of years? I have no idea what exactly their process must look like, but good heavens, a better look need be taken, as this is becoming a bit regular for my tastes.

swinglock5y ago

Yes and no.

Yes, because maybe so.

No, because he issue you're commenting on doesn't suggest that. It looks like the nature of this particular outage is such that a previous iteration of the Internet wouldn't have been any better equipped to solve this faster.

tambre5y ago

Fastly is also seeing problems. [0]

However, they report that they've identified the issue and are fixing it.

[0]: https://status.fastly.com/

xyst5y ago

Internet infrastructure is broken.

Why do a few companies control the backbone of the internet? Shouldn’t there be a fallback or disaster recovery plan if one or more of these companies become unavailable?

kzrdude5y ago

Why doesn't stuff just route around this automatically, if one provider has problems?

johncolanduoni5y ago

The problem is the provider having problems is still sending misconfigured routes after the other providers have tried to pull them in response to the outage. So it’s as if CenturyLink was doing a massive BGP attack against their peers, pointing at a black hole.

q3k5y ago

Things mostly routed around the problem. Issues arose because a) some people are single-homed to Level3/CenturyLink b) apparently Level3/CenturyLink continued announcing unreachable prefixes, which breaks the Internet BGP trust model.

danecek0995y ago

Even https://downdetector.com/ has problems loading for me. Middle Europe *internetweathermap is down

neuronic5y ago

Who watches the Watchmen...

danecek0995y ago

Broadband here just fell down for few minutes, mobile ISP's are ok

hkc5y ago

Chess.com was down due to the outage and some of the Indian players got disconnected and lost on time, so FIDE declared India-Russia joint winner of the Online Chess Olympiad 2020.

eric_khun5y ago

Shameless plug:

I spent too much time losing precious time when github/npm/cloudflare are going down, until I figure out it was them.

So currently working on a project[1] to monitor all the 3rd party stack you use for your services. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.

[1] https://monitory.io

reimertz5y ago

FYI: Your site is down because of GitHub pages maintenance.

Edit: it’s up again!

Just want to let you know about the spelling error ”Save titme” :)

maxshash5y ago

Hi Eric,

Congratulations on your startup!

There is at least one big tool that does exactly the same you wrote. It is called StatusGator https://statusgator.com There are at least 3 much smaller ones.

Have you tried any of them? If yes, what's your point of difference?

And how do you plan to market it? As I see the plans are cheap, means your LTV is low.

vladvasiliu5y ago

Another typo:

> Know when services you depend on goes down

"Services go down", not "goes".

naavis5y ago

Maybe fix this typo? "Save titme on issues investigation"

kzrdude5y ago

And > Monitor all 3rd parties services

*3rd party services or possibly 3rd parties' services

eric_khun5y ago

Thank you, fixed it, I definitely didn't pay attention on this

Now wondering if it impacted conversion rate?

1 more reply

emilstahl5y ago

Cloudflare status page: Update - Major transit providers are taking action to work around the network that is experiencing issues and affecting global traffic.

We are applying corrective action in our data centers as the situation changes in order to improve reachability Aug 30, 14:26 UTC

https://www.cloudflarestatus.com

rantanplan5y ago

Incidentally I can't connect to HN directly from Greece, but only if I use my VPN through New York. Probably somehow related?

RedShift15y ago

Ironically this page doesn't load for me

Cyphase5y ago

I just experienced HN down for several minutes before it loaded and I saw this story at the top.

I'm doing something with the HN API as I type this, so for a moment I was trying to decide if I'd been IP blocked, even though the API is hosted by Firebase.

I haven't noticed any obvious issues elsewhere yet.

(Just got a delay while trying to submit this comment.)

redwood5y ago

Could this be a Russia move vis a vis today's expected Belarus protests?

(I hope this doesn't mean a violent crackdown is imminent)

Oy https://mobile.twitter.com/HannaLiubakova/status/13000645356...

_8j505y ago

I don't see any bgpmon alerts, that's unlikely.

haunter5y ago

I'm in Hungary EU. My fiber works fine but 4G gone except for domestic addresses can't connect to anything

gnicholas5y ago

Can anyone help me understand why I can't access HN from my iPhone, but I can from my computer? both are on the same network. I'm getting "Safari cannot open the page because the server cannot be found", and many apps won't work at all either.

lmm5y ago

One might be using IPv6 and the other v4. Or you might have different DNS settings.

1 more reply

one2know5y ago

Based on twitter, the outage was on multiple continents. What would cause that? Subsea cable broken?

stordoff5y ago

It wasn't a total outage for the site I was trying to reach. It took about 20 minutes to make an order, but after multiple retries (errors were reported as a 522 with the problem being somewhere between Manchester, UK and the host), it did go through.

nottorp5y ago

I have two pipes from two different (consumer ISPs) at home. One can reach HN, the other can't.

Incidentally, uBlock Origin seems to be completely broken. It doesn't have any local blacklists to work when their ?servers? are unavailable?

tpmx5y ago

From the other (Cloudflare) thread (post: https://news.ycombinator.com/item?id=24322603), the outages list (https://puck.nether.net/mailman/listinfo/outages).

https://puck.nether.net/pipermail/outages/2020-August/thread...

Not a network engineer, but based on the comments there it looks like it's a BGP blackhole incident.

Edit: removed details about the similarity to a 1997 incident based in input from commenters.

jsjohnst5y ago

> Not a network engineer, but based on the comments there it looks like it's a BGP blackhole incident, possibly reminiscent of the https://en.wikipedia.org/wiki/AS_7007_incident in 1997.

As you aren’t a network engineer, I can understand making that leap based on the context, but no, this is nothing like the AS7007 event.

The “black hole” in this case is due to networks pulling their routes via AS3356 to try and avoid their outage, but when they do, CenturyLink is still announcing those routes and as such those networks blackhole.

tpmx5y ago

So it's not a BGP blackhole incident then?

jsjohnst5y ago

Not all BGP blackholes are the same. The AS7007 incident from over twenty years ago is an entirely different cause, and thus unrelated.

1 more reply

Darmody5y ago

Half of the internet is down. Crazy...

I can't even access the private WoW server I play.

tc3135y ago

FWIW, I can’t connect to Madden NFL online servers.

rglover5y ago

This knocked out the Starbucks app and some of their systems this morning. A bunch of people in line couldn't log in and they were saying parts of their whole internal system were down, too.

EE84M3i5y ago

I'm confused about why Cloudflare had problems but other CDN providers/sites with private CDNs like Google did not. Is there something different about how Cloudflare operates?

blooalien5y ago

I experienced this issue while reading docs at "Read the Docs" (and ironically had connection issues while trying to read this very exact page right here, too.)

system25y ago

I am having trouble with Hulu right now. I bet it is related.

dancemethis5y ago

Probably due to the incredibly ugly name this company has. No one in their right mind should shake hands with a thing called Level 3.

bovermyer5y ago

SalesForce/Office365 is also having trouble.

corford5y ago

No impact here in Lisbon, PT (using MEO). I can access: HN, twitter, cloudflare, AWS, DO, Hetzner, DDG, Scaleway etc.

tictok45y ago

This (thread) explains why we've been having internet problems this morning.... lots of sites not working.

jsumrall5y ago

The iDeal payment network used by most online stores of the Netherlands was down/flaky all afternoon.

TreeInBuxton5y ago

Looks like an issue with AS3356, they are advertising stale routes - lots of unrelated services impacted

2fast4you5y ago

Centurylink is my isp, it looks like traffic drops out after 2 hops. It’s been this way for a few hours

2fast4you5y ago

Youtube is still trucking though, not sure how that works

_8j505y ago

Youtube colocates at most major ISPs on the planet, that might help.

t0mas885y ago

They have servers inside a lot of ISPs. Same for Netflix.

adamcharnock5y ago

They probably peer into Google at the local IX/Data centre. Google traffic will therefore take a different path which isn’t suffering the current outage.

CarCooler5y ago

Yep, internet has been horrible out here, I had to use Cloudflare DNS to reach websites!

eatmyshorts5y ago

I was doing a big release over the evening. I was working fine up until about 6 hours ago, when I signed off. Our network monitors show an outage started about half an hour later (at about 4:05am CST). Service restored a few minutes ago, at about 9:44am CST. I don't know if our problem is the same as this problem, but we are on CenturyLink.

nurettin5y ago

karpolan5y ago

Deployment to Netlify fails on installing of any version of Node :)

_fool5y ago

more specifically, npmjs.com and nodejs.org are not available from Netlify's datacenter due to this outage.

ausjke5y ago

I wasted two hours for this, diagnosis, reboots,etc.

person_of_color5y ago

Imagine a ransomware attack against these jokers.

ezconnect5y ago

Namecheap is also having network connection issues.

pgoodjohn5y ago

Pressing F for everyone else who was on call today

skee00835y ago

Good. It's about time ISP switched to ipv6.

chkaloon5y ago

Wonder if that's that why Feedly is down

pinkano5y ago

Yes

tiernano5y ago

1.1.1.1 warp is having issues too...

mathieubordere5y ago

stackoverflow seems to be unreachable

ihatecloudflare5y ago

It's probably just another daily outage at CloudFlare, they are famous for their the most unreliable infrastructure on the entire planet.

ramshanker5y ago

I hope these kind of “ipv4” only outages encourages more and more websites to upgrade to ipv6.

#OutageBenefit ;)

itguy43213655y ago

This doesn't have anything to do with IPv4 vs IPv6. It is a routing issue with BGP. To give an analogy,

if every website were a house, and every house has a house number (IP address-- either IPv4 or IPv6), and a group of houses form cities and towns that can be identified by a number (AS/ Atonomous System number), the highways between cities are similar to BGP routes, and if half of the world's internet traffic goes through the city of Centurylink (AS3356),

If the city of CenturyLink (AS3356) shut down traffic, either on purpose or on accident.

...then it doesn't matter if your house number / IP address is a 32bit number or a 128bit number because traffic needs to take a different route.

This is what everyone is worried about BGP routes, not IP addresses.

cuu5085y ago

Sadly, in my experience, ipv4 is generally more reliable than ipv6 still.

Set up two hosts, host A and host B in two different data centers. Make them send HTTP requests to each other over ipv4 and over ipv6. You'll see that latency spikes, packet loss is more frequent over ipv6.

bigdict5y ago

Why is that?

chaboud5y ago

We’ve observed this in end-user devices, especially on some ISPs.

It makes sense if the overall adoption and resource allocation are comparatively smaller, making individual or small-group coincident spikes more impactful against the amortized whole.

It’s a lot like a market with low volume/liquidity. Someone wanders in with a big transaction and blows everything up.

eskaytwo5y ago

It would appear from the limited info so far, to be an issue in the v4 routing configuration - I haven’t seen anything that says this couldn’t have been the other way around.

iso12105y ago

Few people care if ipv6 breaks so it doesn't make headlines

tpmx5y ago

How the xxxx did it take CenturyLink/Level3 like 3-4 hours to fix this problem?

Again (https://news.ycombinator.com/item?id=24322988) not a network engineer, but it seemed like their routers actively stopped other networks from working around the problem since L3 would still keep pushing other networks' old routes, even after those networks tried to stop that.

Also: BGP probably needs to redesigned from the ground up by software engineers with experience from designing systems that can remain working with hostile actors.

q3k5y ago

> Also: BGP probably needs to redesigned from the ground up by software engineers with experience from designing systems that can remain working with hostile actors.

This has been attempted a number of times, but this is a political problem, not a technical problem: there's no single agreed source of truth for routing policy.

A lot of US Internet providers won't even sign up for ARIN IRR, or even move their legacy space to a RIR - so there isn't even any technical way of figuring out address space ownership and cryptographic trust (ie. via RPKI). Hell, some non-RIR IRRs (like irr.net) are pretty much the fanfiction.net equivalent of IRRs, with anyone being able to write any record about ownership, without any practical verification (just have to pay a fee for write access). And for some address space, these IRRs are the only information about ownership and policy that exists.

Without even knowing for sure who a given block belongs to, or who's allowed to announce it, or where, how do you want to fix any issues with a new dynamic routing protocol?

kitteh5y ago

RPKI is a totally diff problem here, though.

If people refuse to sign ROAs, then they don't get protection. The ARIN TAL thing is real and people have to keep fighting that.

As it is right now you can xfer v4 out of ARIN but not v6. So even if you wanted to you can't.

tpmx5y ago

Build an industry coalition. Put pressure on those who don't join. Randomly throw away 1 out of 10000 packets from the providers that fail to get with the times. Increase that frequency according to some published time function.

sneak5y ago

Having a single, cryptographically assured source of truth for routing data is a turnkey censorship nightmare waiting to happen.

All it takes is a national military to care enough to put pressure on the database operator, legal or otherwise, and suddenly your legitimate routes are no longer accepted.

If you think this wouldn't be used to shut down things like future Snowden-style leaks or Wikileaks or The Shadow Brokers, you may not have been paying attention to the news.

2 more replies

q3k5y ago

> Build an industry coalition. Put pressure on those who don't join. Randomly throw away 1 out of 10000 packets from the providers that fail to get with the times. Increase that frequency according to some published time function.

What sort of incentive would anyone have to join such a coalition? Why would anyone work with providers from such a coalition, when they can work with an alternative ISP outside it and not have to deal with packet drops?

I think you're underestimating how many people have been attempting to solve this. The Internet community has some quite clever people in it, but it's also very, very large, and sweeping changes are difficult to pull off (see: IPv6 adoption).

kazen445y ago

and who should be the spearhead of this coalition?

Let's not forget that this is mainly a political problem and not a technical one. Would countries be willing to join a coalition with heavy influence from china for example? (or vice versa with the US).

DyslexicAtheist5y ago

> Also: BGP probably needs to redesigned from the ground up

SCION from ETH Zurich:

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

tpmx5y ago

Based on what I've seen: They essentially "shut down the Internet" for probably a quarter of the global population for about 3-4 hours.

That response time is atrocious. It wasn't that they needed to fix broken hardware, rather they needed to stop running hardware from actively sabotaging the global routing via the inherently insecure BGP protocol. That took 3-4 hours to happen.

As an example: Being in Sweden with an ISP that uses Telia Carrier for connectivity things started working around the time of https://twitter.com/TeliaCarrier/status/1300074378378518528

swinglock5y ago

Seems they didn't even get around to doing so, rather asking other carriers to stop peering with them.

https://twitter.com/TeliaCarrier/status/1300074378378518528?...

matsur5y ago

CenturyLink requested depeering to give them some breathing room and stop the bleeding. Hug ops.

tpmx5y ago

That is a fantastic euphemism. Personally I'm disappointed Telia didn't de-peer two hours earlier, after diagnosing the issue for 30 minutes, since that whole lack of functioning routning to very large parts of the internet forced me to use VPN in north america to access many web services, including HN.

I realize I'm going to get insanely downvoted by the elite internetworking crowd again but I think this needs to be said.

From an outsider's POV: There seems to be a very strange and almost incestual relationship between the networking companies. Or maybe it's just their hangaround supporters? I dunno.

j / k navigate · click thread line to collapse

368 comments

dz0ny5y ago

Source https://puck.nether.net/pipermail/outages/2020-August/013229...

kitteh5y ago

Flowspec strikes again.

parliament325y ago

mrguyorama5y ago

It's possible that if a tool allows you to destroy everything with a single click, that tool (or maybe process) is bad

kitteh5y ago

Edit: if you are a Level3 customer shut your sessions down to them.

beagle35y ago

History doesn't repeat, but it rhymes ....

There was a huge AT&T outage in 1990 that cut off most US long distance telephony (which was, at the time, mostly "everything not within the same area code").

Once the problem was understood, they reverted to an earlier version which didn't have the bug, and the network re-stabilized.

[0] https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...

phkahler5y ago

Contrary to what that link says, the software was not thoroughly tested. Normal testing was bypassed - per management request after a small code change.

jacquesm5y ago

> Normal testing was bypassed - per management request after a small code change.

That lesson will really never be learned. This happens on a daily basis all over the planet with people who have not been bitten - yet.

4 more replies

pacificmint5y ago

Probably not the book you are thinking off, since it’s just about the AT&T incident, but “The Day the Phones Stopped Ringing” by Leonard Lee is a detailed description of the event.

It’s been many years since I read it, but I recall it being a very interesting read.

tinco5y ago

The lessons are definitely still taught, I don't know if they're actually learned of course.. And who knows who actually taught the 737-Max software devs, I don't suppose they're fresh out of uni.

1 more reply

_heimdall5y ago

The build that broke it was rushed out and never fully tested, adding a fairly useless feature for said higher-up that improved the UX for users with multiple houses on their account.

twic5y ago

ficklepickle5y ago

I can't find anything by that name either, but the details do match the major ARPANET outage of Oct 27, 1980.

The incident is detailed in RFC 789:

http://www.faqs.org/rfcs/rfc789.html#b

1 more reply

peterwwillis5y ago

BoorishBears5y ago

The idea isn't that Erlang is infallible in the design of distributed systems.

mcspiff5y ago

There was something similar a few years ago on a large US mobile network. You could watch the ‘storm’ rolling across the map. Fascinating stuff

throw0101a5y ago

Are you referring to CenturyLink’s 37-hour, nationwide outage?

* https://arstechnica.com/information-technology/2019/08/centu...

hinkley5y ago

I think we used to call that a poison pill message (still bring it up routinely when we talk about load balancing and why infinite retries are a very, very bad idea).

mikelward5y ago

Some queue processing systems I've seen have infinite retries.

At least they have exponential backoff I guess.

1 more reply

doctorshady5y ago

p_l5y ago

1 more reply

cat1995y ago

> best known result is Erlang and its ecosystem

not expert but erlang is listed as 1986, so that would seem not directly related https://en.wikipedia.org/wiki/Erlang_(programming_language)

dadver5y ago

swinglock5y ago

Indeed. In 2018 an Erlang telco software did break, bringing down the UK and Japan.

pcc5y ago

If memory serves that also involved an expired certificate

1 more reply

sbmthakur5y ago

A thread discussing that event:

https://news.ycombinator.com/item?id=24323412

immigrantsheep5y ago

Is that related to the hacker's crackdown?

chrisweekly5y ago

Fascinating. Thanks for sharing! :)

kitteh5y ago

Most of level3s settlement free peers aka "tier 1s" have shutdown or depreffed their sessions with them.

Example: https://mobile.twitter.com/TeliaCarrier/status/1300074378378...

kitteh5y ago

Root cause identified. Folks are turning things back on now.

emilstahl5y ago

Source?

1 more reply

guerrilla5y ago

What is a reconvergence event? Is that what's described in your last sentence?

snuxoll5y ago

Convergence time is a known bugbear of BGP.

mitchs5y ago

swinglock5y ago

https://en.wikipedia.org/wiki/Convergence_(routing)

Yajirobe5y ago

I know some of these words

rimjongun5y ago

Can you tell tell me more about what happened, but in a way that for a person who struggled with the CCNA? I’ve never heard of a reconvergence event.

emilstahl5y ago

https://twitter.com/CenturyLink/status/1300089110858797063

ystad5y ago

I hope they provide a root cause analysis

colde5y ago

Based on experience it will probably not public, or at least very limited.

But customers are likely to get one, at least if they request it.

rootsudo5y ago

Being it was pretty big, they'll probably make it public.

jlgaddis5y ago

https://news.ycombinator.com/item?id=24324280

MLij5y ago

mark_l_watson5y ago

I am glad they declared a tie. Seems fair.

Abishek_Muthian5y ago

Thanks for the update.

redwood5y ago

Interesting. How would connection issues cause them to lose? Was it a timed round?

colinbartlett5y ago

repiret5y ago

cyphar5y ago

hinkley5y ago

For this to work on an internet competition, the judges would need a backup, possibly very low bandwidth communication mechanism that survives a network outage.

This wouldn’t save any real-time esports, but would be serviceable for turn based systems.

1 more reply

MLij5y ago

Yes, two players lost on time.

aminozuur5y ago

That's fascinating. But I wonder, why don't they start over, or continue where they left off, once the internet is back?

Strom5y ago

> continue where they left off

The games are timed and this pause gives a lot of thinking time. If they're allowed to talk with others during the pause, then also consulting time.

> why don't they start over

That would be unfair to the player who was ahead.

derefr5y ago

Seems like one of those cases where solving a “little” issue would actually require rearchitecting the entire system.

1 more reply

suby5y ago

one2know5y ago

Yeah, something happened to ipv4 traffic worldwide. Don't see how that could happen.

pps435y ago

Let me guess: somebody misconfigured BGP again?

johnisgood5y ago

https://puck.nether.net/pipermail/outages/2020-August/013198...

ra5y ago

likely

Sebb7675y ago

That's definitely going to be an interesting postmortem.

opan5y ago

Seconding this. Had some ssh connections timing out repeatedly just a bit ago. Also got disconnected on IRC.

cotillion5y ago

IKEA had their payment system go down worldwide also. I really doubt that uses Cloudflare.

Cameron_D5y ago

It's not a just CloudFlare outage, its a global CenturyLink/Level3 outage

tpmx5y ago

Is there a ranking board for which carriers have caused the most accumulated network carnage out there? I think the world deserves this.

mikegioia5y ago

Me too. I can only connect to one of my DO servers. The rest are all unreachable.

Yetanfou5y ago

As noticed in another comment I see loads of problems within Cogentco, all on *.atlas.cogentco.com. Might the problem lies there?

mikiem5y ago

pseudoramble5y ago

toss15y ago

Yup, definitely noticed earlier outages to both EU sites and also to HN. Looked far upstream because many sites/lots of things worked fine. Good to see it's at least largely fixed

mercer5y ago

I had problems accessing my Hetzner VPS', but I haven't tried connecting directly with the IP. So I suppose it could be a DNS thing?

mikiem5y ago

exikyut5y ago

Oooh, maybe that's why HN wasn't working for me a little while ago (from AU)...

eastdakota5y ago

ngold5y ago

Great write up. It is embarrassing that most of America has no competition in the market.

throwaway3446555y ago

lemiffe5y ago

I had this earlier! A bunch of sites were down for me, I couldn't even connect to this site.

Is there a source where you can get instant information on Level3 / global DNS / major outages?

kitteh5y ago

Ddos tracking sites are eye candy and garbage. Stop using them.

Outages and nanog lists are your best bet, short of being on the right IRC channels.

xwdv5y ago

What are the right IRC channels?

dudus5y ago

I believe these are mostly non public channels where backbone and network infrastructure engineers from different companies congregate to discuss outages like this.

3 more replies

pseudoramble5y ago

I agree!

So yeah, I too would love a really easy to use page that could show outages like this. It would be really great to be able to specify vendors used to really piece the puzzle together.

manceraio5y ago

I had a similar issue with my droplets as well. I thought I messed up something and then suddenly it worked again.

jrockway5y ago

aosaigh5y ago

Has anyone any good resources for learning more about the "internet-level" infrastructure affected today and how global networks are connected?

q3k5y ago

Unfortunately, this infrastructure is at an uncanny intersection of technology, business and politics.

Disclaimer: I helped start a non-profit ISP in part to learn about these things in practice.

bausano_michael5y ago

Judging by other comments, it seems there's a space to fill this niche with a series of blog articles or a book, if you're that sort of person.

kitteh5y ago

There are plenty of presentations out there. See nanog, ripe.

The books are meh because they're not written by operators. They're more academic and dated.

Plenty of clueful folks on the right IRC channels.

1 more reply

jsjohnst5y ago

> or even just getting an ASN and some IPv6 PA space and trying to announce it somewhere

That’s fairly expensive to do just for a hobby interest, but at least the price has came down since I last looked.

q3k5y ago

2 more replies

akritrime5y ago

What resources can I follow to start a non-profit ISP? I want to start one in my hometown for students who couldn't afford internet to join online classes.

dboreham5y ago

Why not just raise money to pay for service from for-profit providers? Much more efficient use of donation funds.

2 more replies

walrus015y ago

odensc5y ago

https://startyourownisp.com/

q3k5y ago

There's a bunch of guidelines for starting a (W)ISP depending on your region.

1 more reply

throw0101a5y ago

"An Open Platform to Teach How the Internet Practically Works" from NANOG perhaps:

* https://www.youtube.com/watch?v=8SRjTqH5Z8M

The Network Startup Resource Center out of UOregen has some good tutorials on BGP and connecting networks owned by different folks:

* https://learn.nsrc.org/bgp

NANOG also has a lot of good videos on their channel from their conferences, including one on optical fibre if you want to get into the low-level ISO Layer 1 stuff:

* https://www.youtube.com/watch?v=nKeZaNwPKPo

In a similar vein, NANOG "Panel: Demystifying Submarine Cables"

* https://www.youtube.com/watch?v=Pk1e2YLf5Uc

bogomipz5y ago

Once you understand BGP and Autonomous Systems(AS), you can then understand peering as well as some of the politics that surround it.[2]

Then you can learn more about how specific networks are connected via public route servers and looking glass servers.[3][4][5]

Probably one of the best resource though still is to work for an ISP or other network provider for a stint.

[1] https://www.oreilly.com/library/view/bgp/9780596002541/

[2] http://drpeering.net/white-papers/Internet-Service-Providers...

[3] http://www.traceroute.org/#Looking%20Glass

[4] http://www.traceroute.org/#Route%20Servers

[5] http://www.routeviews.org/routeviews/

Famicoman5y ago

EvanAnderson5y ago

marmshallow5y ago

Many of the comments here presume knowledge about this stuff, and I can’t follow.

gkanai5y ago

Don't forget Neal Stephenson's classic, "Mother Earth, Mother Board." 25 years old but still relevant.

https://www.wired.com/1996/12/ffglass/

walrus015y ago

antsoul5y ago

US and Israel looking at China's strategy at BGP-level in 2018 :

https://scholarcommons.usf.edu/mca/vol3/iss1/7/

vesh5y ago

This might help https://community.fs.com/blog/tcpip-vs-osi-whats-the-differe...

ZWoz5y ago

DrPeering is good material: http://drpeering.net/tools/HTML_IPP/ipptoc.html

Geoff Huston paper "Interconnection, Peering, and Settlements" is older, but still interesting and several ways relevant.

I suggest "Where Wizards Stay Up Late: The Origins Of The Internet" - generic and talks about Internet history, but mentions several common misconseptions.

albertTJames5y ago

https://mobile.twitter.com/Level3 (not an internet level, just a company :)

Vinnl5y ago

Were their tweets protected (i.e. only visible to approved followers) when you posted that link, or is that in response to this event?

iso9475y ago

Level3 was qquired/merged/changed to century link a year or so back, I think they closed their old twitter account then

When someone says level3, read century link. L3 have been a major player for decades though (including providing the infamous 4.2.2.2 dns server), so people still refer to them as level3.

The account to follow for them now is https://mobile.twitter.com/CenturyLink but it won’t tell you much.

1 more reply

mmaunder5y ago

Read Internet Routing Architectures by Sam Halabi. It’s almost 20 years old now but BGP hasn’t changed and the book is still called The Bible by routing architects.

kitteh5y ago

It's dated and not particularly useful if you want to learn how things are really done on the internet in a practical sense. So if you read it, be prepared to unlearn a bunch of stuff.

colechristensen5y ago

I don't know something holistic, but if you are the Wikipedia rampage sort of person, here is a good place to start:

https://en.wikipedia.org/wiki/Internet_exchange_point

maxmouchet5y ago

"Tubes" is a good book to get an high level overview: https://www.penguin.co.uk/books/178533/tubes/9780141049090

TallGuyShort5y ago

late2part5y ago

read the last 26 years of NANOG archives

Yetanfou5y ago

    3  sth-cr2.link.netatonce.net (85.195.62.158) 
    4  te0-2-1-8.rcr51.b038034-0.sto03.atlas.cogentco.com 
    5  be3530.ccr21.sto03.atlas.cogentco.com (130.117.2.93)
    6  be2282.ccr42.ham01.atlas.cogentco.com (154.54.72.105)  
    7  be2815.ccr41.ams03.atlas.cogentco.com (154.54.38.205) 
    8  be12194.ccr41.lon13.atlas.cogentco.com (154.54.56.93)   
    9  be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  
   10  be2315.ccr31.bio02.atlas.cogentco.com (154.54.61.113)  
   11  be2113.ccr42.atl01.atlas.cogentco.com (154.54.24.222)  
   12  be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)
   13  be2027.ccr22.mia03.atlas.cogentco.com (154.54.86.206)
   14  be2025.ccr22.mia03.atlas.cogentco.com (154.54.47.230)
   15  * level3.mia03.atlas.cogentco.com (154.54.10.58) 
   16  * * *
   17  * * *

cotillion5y ago

So as other providers shut down their links to Centurylink to save themselves the outgoing packets towards centurylink travel to some part of the world where links are not shut down yet.

vld5y ago

I'm having issues reaching IP addresses unrelated to Cloudflare. Based on some traceroutes, it seems AS174 (Cogent) and AS3356 (Level 3) are experiencing major outages.

jbotz5y ago

Is there any one place that would be a good first place to go to check on outages like this?

guerby5y ago

In the network world there's the outages mailing list:

https://puck.nether.net/mailman/listinfo/outages

Public archives:

https://puck.nether.net/pipermail/outages/

Latest issue reported:

https://puck.nether.net/pipermail/outages/2020-August/013187... "Level3 (globally?) impacted (IPv4 only)"

cultureulterior5y ago

https://www.thousandeyes.com/outages

tuukkah5y ago

Based on that map, Telia seems to be one of the most affected which might explain why Scandinavia is so badly affected.

thejosh5y ago

Until that site also goes down.

lioeters5y ago

Indeed, if we're to have a public Internet health meter, it must be distributed and hosted/served from "outside" somehow, to be resilient to all or parts of the network being down.

1 more reply

efreak5y ago

Something something anycast.

exikyut5y ago

This is an excellent idea and simple but moderately expensive for anyone to set up.

Just have a site fetch resources from every single hosting provider everywhere. A 1x1 image would be enough, but 1K/100K/1M sized files might also be useful (they could also be crafted images)

The first step would be making the HTML page itself redundant. Strict round robin DNS might work well for that.

But yeah, moderately expensive - and... thinking about it... it'll honestly come in handy once every ten years? :/

svdr5y ago

I go here :-)

mnadkvlb5y ago

Sounds like a good idea. The closest i know is the one from pingdom which i use the most. Its not detailed enough though. https://livemap.pingdom.com/

dexterdog5y ago

You just imagined the first target in an attack. Might as well just call it honeypotnumber1.

rewtraw5y ago

Reddit, HN, etc. are inaccessible to me over my Spectrum fiber connection, but working on AT&T 4G. It’s not DNS, so a tier 1 ISP routing issue seems to be the most likely cause.

kossTKR5y ago

Lots of local sites not working in Scandinavia either. So seems more global than a single Tier 1?

phoe-krk5y ago

Probably relevant Fastly update:

dbetteridge5y ago

HN and reddit out on my talktalk link in London, 3 mobile 4g working normally.

aeyes5y ago

Can confirm for a number of sites, even Hacker News was unreachable for me.

Benjamin_Dobell5y ago

This explains a lot. Initially thought my mobile phone Internet connectivity was flakey because I couldn't access HN here in Australia, whilst it's fine over wi-fi (wired Internet).

abhishekjha5y ago

Its reverse for me. The broadband fails to connect to HN but my mobile ISP is able to reach it fine.

josephb5y ago

Because networks are connected to others via different paths, it's not unusual that one method of connectivity would work and one doesn't.

Also the Internet has lots of asymmetric traffic, just because a forward path towards a destination may look the same from different networks, it doesn't mean the reverse path will be similar.

willis9365y ago

Same for me in midwest US.

bmlzootown5y ago

Hopefully this will get fixed within a reasonable timespan.

every5y ago

ycombinator.com pinged just fine but news.ycombinator.com dropped 100% packets. But all better now...

yreg5y ago

I was so pissed at Waze earlier for giving up on me in a critical moment. Then I found out I'm also unable to send iMessages, but I was curious, since I could browse the web just fine.

When something doesn't work I always assume it's a problem with my device/configuration/connection.

Who would have thought it's a global event such as the repeated Facebook SDK issues.

bonestamp25y ago

iso12105y ago

Looks like Centurylink/Level3 (as3356) might not be withdrawing routes after people close their peering?

josephb5y ago

That's what various networks have reported.

It kind of makes it hard to route around an upstream, if they keep announcing your routes even when there isn't a path to you!

swinglock5y ago

Quick hack; split all your announcements in two, making the new announcement route around their old stale announcement by being more specific.

regolithori5y ago

What could cause this? I wonder what the technical problem is.

q3k5y ago

These are usually called 'BGP Zombies', and here's a good summary of their prevalence and usual causes: https://labs.ripe.net/Members/romain_fontugne/bgp-zombies

In this case however, it seems to be an L3/CL-specific bug.

jcims5y ago

I would love to hear the inside scoop from folks working at CenturyLink. I’ve used their DSL for years and the network is a mess. I don’t know if it them here or legacy Level3 but i have a guess.

Edit: Looks like i would have guessed wrong :P. Still want that inside scoop!

iso12105y ago

Used level3 IP for a long time professionally with limited issues, ceratainly not on the list of worst ISPs.

Also used a company that over the years has gone from Genesis, GlobalCrossing, Vyvx, Level3 and now of course Level 3 is CenturyLink, which has been fine.

crizzlenizzle5y ago

bregma5y ago

Misread the headline as "Level 3 Global Outrage" and thought "someone had defined outrage levels?" and "it doesn't matter, he'll just attribute it to the Deep State".

In some ways I'm a little bit disappointed it's only a glitch in the internet.

1 more reply

_eigenfoo5y ago

Can somebody please clarify - what exactly is this an outage of, and how serious is it?

rmrfstar5y ago

[1] https://web.stanford.edu/class/msande91si/www-spr04/readings...

[2] https://en.wikipedia.org/wiki/Tier_1_network

jsjohnst5y ago

tl;dr One of the large Internet backbone providers (formerly known as Level3, but now known as CenturyLink usually) that many ISPs use is down. Expect issues connecting to portions of the Internet.

Usually the Internet is a bit more resilient to these kinds of things, but there are complicating factors with this outage making it worse.

Expect it to mostly be resolved today. These things have happened a bit more frequently, but generally average up to a couple times a year historically.

g105b5y ago

Is this affecting all geographic regions?

dredmorbius5y ago

US, Europe, and Asia that I'm aware of (NANOG mailing list).

mikro2nd5y ago

Had to laugh: "I'm seeing complaints from all over the planet on Twitter"

The one site I can't see is Twitter. (Not a heart-wrenching loss, mind you...)

quickthrower25y ago

I could not get on HN as a logged in person (logged out was OK) during this. I wondered how big the cloudflare thread would be if people could get on to comment on it :-)

emilstahl5y ago

CNN just blames Cloudflare.. :facepalm: https://edition.cnn.com/2020/08/30/tech/internet-outage-clou...

ihatecloudflare5y ago

CNN is absolutely right. Every day I read news that something goes down at CloudFlare. CloudFlare do much more harm than they "fix" with their services.

dathinab5y ago

I guess that why HN was temporary unreachable from my home?

protomyth5y ago

and why Cloudflare was having so many issues https://www.cloudflarestatus.com/

jetru5y ago

Oh lord. I'm oncall and we were like "WHATS HAPPENING"

b3lvedere5y ago

Same here :) Couple of companies started complaining. Told them it's a worldwide issue. It seems going better at the moment.

iso12105y ago

No peering problems from my network with Level3 in London Telehouse West, maybe a minute or so of increased latency at 10:09 GMT

Routing to a level3 ISP I have an office in in the states peers with London15.Level3.net

No problem to my Cogent ISP in the states, although we don't peer directly with Cogent, that bounces via Telia

Going east from London, a 10 second outage at 12:28:42 GMT on a route that runs from me, level3, tata in India, but no rerouting.

johnchristopher5y ago

So, that's why HN is unreachable from Belgium at the moment (right when I was trying to figure a dns cache problem in Firefox,of course).

An ssh tunnel through OVH/gravelines is working so far. edit: Proximus. edit2: also, Orange Mobile

iso12105y ago

HN working for me from the UK on BT, but traceroute showing lots of different bouncing around and a lot of different hops in the US

  7  166-49-209-132.gia.bt.net (166.49.209.132)  9.877 ms  8.929 ms
    166-49-209-131.gia.bt.net (166.49.209.131)  8.975 ms
  8  166-49-209-131.gia.bt.net (166.49.209.131)  8.645 ms  10.323 ms  10.434 ms
  9  be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  95.018 ms
    be3487.ccr41.lon13.atlas.cogentco.com (154.54.60.5)  7.627 ms
    be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  102.570 ms
  10  be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197)  89.867 ms
    be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  101.469 ms  101.655 ms
  11  be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106)  103.990 ms  93.885 ms
    be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197)  97.525 ms
  12  be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)  106.027 ms
    be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106)  98.149 ms  97.866 ms
  13  be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  120.558 ms  122.330 ms  120.071 ms
  14  be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  123.662 ms
    be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222)  128.351 ms
    be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  120.746 ms
 15  be2929.ccr31.phx01.atlas.cogentco.com (154.54.42.65)  145.939 ms  137.652 ms
    be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222)  128.043 ms
  16  be2930.ccr32.phx01.atlas.cogentco.com (154.54.42.77)  150.015 ms
    be2940.rcr51.san01.atlas.cogentco.com (154.54.6.121)  152.793 ms  152.720 ms
  17  be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33)  152.881 ms
    te0-0-2-0.rcr11.san03.atlas.cogentco.com (154.54.82.66)  153.452 ms
    be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33)  152.054 ms
  18  te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70)  162.835 ms
    te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190)  146.643 ms
    te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70)  153.714 ms
  19  te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190)  151.212 ms  145.735 ms
    38.96.10.250 (38.96.10.250)  147.092 ms
  20  38.96.10.250 (38.96.10.250)  149.413 ms * *

josephb5y ago

Guessing the traceroute looks a bit messy because of multiple paths being available.

You can use `-q 1` to send a single traceroute probe/query instead of the default 3, it might make your traceroute look a little cleaner.

iso12105y ago

Heading to Tata in India, route out is via same peer to level3, then onto the London, Marseile, and then peers with Tata in Marseille, working fine.

My gut feeling is a core problem in Level3's continental US network rather than something more global.

1 more reply

sgt5y ago

It was also down from South Africa. It's luckily up now. Gasps for breath

kuroguro5y ago

Was down from Latvia too, up now.

wiremine5y ago

In a situation like this, what are the best "status" sites to be watching?

OskarS5y ago

HN is not the worst place, honestly.

Timothycquinn5y ago

Agreed. I went to Reddit r/networking and the mods were closing helpful threads in real-time :(

innocenat5y ago

HN was down for me, unfortunately. (Connecting from Japan, so most CDN-based website load fine since it isn't route via Europe)

ojagodzinski5y ago

https://downdetector.com/ client perspective is best perspective ;) Problem in this outage is that site X works ok but transit provider for clients in US works badly and generates "false positives"

traceroute665y ago

For a situation like this, the various tools hosted by RIPE are likely your best bet. You won't get a pretty green/red picture, but you'll get a more than enough data to work with.

_8j505y ago

stat.ripe.net

dkdk82835y ago

Nanog is also pretty helpful for this specific type of issue

zowanet5y ago

Here's a direct link to this month's messages:

https://mailman.nanog.org/pipermail/nanog/2020-August/thread...

lucb1e5y ago

You mean nanog.org? I don't see a stats page linked in their menu.

toomuchtodo5y ago

quickthrower25y ago

Ham radio might be the answer to this one day!

tn8905y ago

https://www.internetweathermap.com/map

tillinghast5y ago

Except for the fact that internetweathermap.com is super green, and the internet is not currently super green.

eric_khun5y ago

Currently working on a project[1] to monitor all the 3rd party stack you use for your services. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.

[1] https://monitory.io

stevekemp5y ago

Your front page has a typo: "titme".

Since hacker news was down yesterday I couldn't reply here, so I tried to send you an email, but that failed to deliver, as there are no MX records for monitory.io...

gnyman5y ago

Got questioning if I really disconnected it before I left.

I'm wondering if we're at the point where internet outages should have some kind of (emergency) notification/sms sent to _everyone_.

dredmorbius5y ago

NANOG are talking about a CenturyLink outage and BGP flapping (AS 3356) as of 03:00 US/Pacific, AS209 possibly also affected.

AS3356 is Level 3, AS209 is CenturyLink.

https://mailman.nanog.org/pipermail/nanog/2020-August/209359...

ffpip5y ago

DDG, down detector are all very slow. Both are on cloudflare.

Fastly, HN, Reddit too.

Only Google domains are loading here.

thejteam5y ago

From where I am (mid-altantic US) Google site are completely down (google.com, youtube)

jlgaddis5y ago

> "Root Cause: An offending flowspec announcement prevented BGP from establishing correctly, impacting client services."

That doesn't really explain the "stuck" routes in their RRs... maybe it'll make sense once we've gotten some more details...

quickthrower25y ago

This might be a silly question but is there such a thing as CI/CD for this sort of thing that may have caught the problem?

dsr_5y ago

There are two aspects to this:

1. Is there syntax correctness checking available, so you don't push a config that breaks machines? Yes.

based25y ago

https://status.ctl.io/history/f19a0555-abbd-4038-91cb-b55a76...

https://twitter.com/g_bonfiglio/status/1300022993251446785?s...

https://old.reddit.com/r/networking/comments/ijb8tn/global_a...

blantonl5y ago

Everything to Oracle Cloud's Ashburn US-East location is down.

Their console isn't responding at all and all my servers are unreachable. Their status console reports all normal though.

system25y ago

Status pages of the companies are just PR disasters for them. Most of the time they don't report what's up.

tyfon5y ago

Seems like "the internet" works again here in Norway. I've been limited to local sites all day.

Hacker news has been off for several hours for me.

Whatever it was it must have been nasty.

djxfade5y ago

I had the same issue on my fiber connection (Altibox/BKK), however, no problems on my mobile using 4G (Dipper/Telenor)

matsemann5y ago

I couldn't reach HN on neither Altibox or 4g/telenor.

tyfon5y ago

Both altibox and telia 4g was down for me as well.

janmo5y ago

There is a major internet outage going on. I am using Scaleway they are also affected. According to Twitter, Vodafone, CityLink and many more are also affected.

gailees5y ago

The beginning of WWIII probably looks something like this.

vbsteven5y ago

I'm having lots of issues with Hetzner machines not being available (and even the hetzner.com website). Don't know if this is related.

zepearl5y ago

Fyi I'm not having any problems right now with hetzner.com nor hetzner.de - my own dedicated server hosted at Hetzner datacenter in Germany seems to be reachable/working as well.

Connecting from Switzerland.

vinni25y ago

I had to use a VPN With US location to post this comment. I am in Europe.

lucb1e5y ago

HN works fine from Germany with Telefonica (O2) and also from the Netherlands with XS4ALL.

Edit: Somewhere between 14:00 and 14:46Z it also went down from O2; XS4ALL still works, and O2 can reach XS4ALL.

minxomat5y ago

No luck on T-Mobile

omnibrain5y ago

Yes, I had to switch to my Vodafone eSIM for data to connect to Hacker News.

crizzlenizzle5y ago

Yup.

``` Prefix 209.216.230.0/24 BGP as_path 3356 21581 21581 ```

As seen from AS3320.

1 more reply

vladvasiliu5y ago

Works OK in Paris via SFR (home fiber) and Sipartech (Business fiber).

Doesn't work via Bouygues 4G.

My SFR fiber doesn't seem affected all that much. I've been following this for a while on the other HN post [0] and all services people have noted seem to work here.

Both SFR and Sipartech seem to have direct peerings with Cogentco.

[0] https://news.ycombinator.com/item?id=24322513

edit: Spotify seems partially down: app doesn't say it's offline, but songs won't play.

osipovas5y ago

naringas5y ago

seems like the internet in 2020 has a diminished ability to route around damage

Frost1x5y ago

On the other hand, if a system is critical, then I think it behooves us to continue looking at improvements of existing systems and alloting resources to investigating new approaches.

sp3325y ago

BGP has always had this issue. It depends on trustworthy information being available. Any trusted source who starts lying (or just screws up) is going to cause routing problems.

salawat5y ago

Note, trustworthyness jumps off of being a technical problem, and becoming a human/people problem. Level 8 as someone mentioned, or GIGO (Garbage-In-Garbage-Out) as others may know it.

swinglock5y ago

Yes and no.

Yes, because maybe so.

tambre5y ago

Fastly is also seeing problems. [0]

However, they report that they've identified the issue and are fixing it.

[0]: https://status.fastly.com/

xyst5y ago

Internet infrastructure is broken.

Why do a few companies control the backbone of the internet? Shouldn’t there be a fallback or disaster recovery plan if one or more of these companies become unavailable?

kzrdude5y ago

Why doesn't stuff just route around this automatically, if one provider has problems?

johncolanduoni5y ago

q3k5y ago

danecek0995y ago

Even https://downdetector.com/ has problems loading for me. Middle Europe *internetweathermap is down

neuronic5y ago

Who watches the Watchmen...

danecek0995y ago

Broadband here just fell down for few minutes, mobile ISP's are ok

hkc5y ago

Chess.com was down due to the outage and some of the Indian players got disconnected and lost on time, so FIDE declared India-Russia joint winner of the Online Chess Olympiad 2020.

eric_khun5y ago

Shameless plug:

I spent too much time losing precious time when github/npm/cloudflare are going down, until I figure out it was them.

So currently working on a project[1] to monitor all the 3rd party stack you use for your services. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.

[1] https://monitory.io

reimertz5y ago

FYI: Your site is down because of GitHub pages maintenance.

Edit: it’s up again!

Just want to let you know about the spelling error ”Save titme” :)

maxshash5y ago

Hi Eric,

Congratulations on your startup!

There is at least one big tool that does exactly the same you wrote. It is called StatusGator https://statusgator.com There are at least 3 much smaller ones.

Have you tried any of them? If yes, what's your point of difference?

And how do you plan to market it? As I see the plans are cheap, means your LTV is low.

vladvasiliu5y ago

Another typo:

> Know when services you depend on goes down

"Services go down", not "goes".

naavis5y ago

Maybe fix this typo? "Save titme on issues investigation"

kzrdude5y ago

And > Monitor all 3rd parties services

*3rd party services or possibly 3rd parties' services

eric_khun5y ago

Thank you, fixed it, I definitely didn't pay attention on this

Now wondering if it impacted conversion rate?

1 more reply

emilstahl5y ago

Cloudflare status page: Update - Major transit providers are taking action to work around the network that is experiencing issues and affecting global traffic.

We are applying corrective action in our data centers as the situation changes in order to improve reachability Aug 30, 14:26 UTC

https://www.cloudflarestatus.com

rantanplan5y ago

Incidentally I can't connect to HN directly from Greece, but only if I use my VPN through New York. Probably somehow related?

RedShift15y ago

Ironically this page doesn't load for me

Cyphase5y ago

I just experienced HN down for several minutes before it loaded and I saw this story at the top.

I'm doing something with the HN API as I type this, so for a moment I was trying to decide if I'd been IP blocked, even though the API is hosted by Firebase.

I haven't noticed any obvious issues elsewhere yet.

(Just got a delay while trying to submit this comment.)

redwood5y ago

Could this be a Russia move vis a vis today's expected Belarus protests?

(I hope this doesn't mean a violent crackdown is imminent)

Oy https://mobile.twitter.com/HannaLiubakova/status/13000645356...

_8j505y ago

I don't see any bgpmon alerts, that's unlikely.

haunter5y ago

I'm in Hungary EU. My fiber works fine but 4G gone except for domestic addresses can't connect to anything

gnicholas5y ago

lmm5y ago

One might be using IPv6 and the other v4. Or you might have different DNS settings.

1 more reply

one2know5y ago

Based on twitter, the outage was on multiple continents. What would cause that? Subsea cable broken?

stordoff5y ago

nottorp5y ago

I have two pipes from two different (consumer ISPs) at home. One can reach HN, the other can't.

Incidentally, uBlock Origin seems to be completely broken. It doesn't have any local blacklists to work when their ?servers? are unavailable?

tpmx5y ago

From the other (Cloudflare) thread (post: https://news.ycombinator.com/item?id=24322603), the outages list (https://puck.nether.net/mailman/listinfo/outages).

https://puck.nether.net/pipermail/outages/2020-August/thread...

Not a network engineer, but based on the comments there it looks like it's a BGP blackhole incident.

Edit: removed details about the similarity to a 1997 incident based in input from commenters.

jsjohnst5y ago

> Not a network engineer, but based on the comments there it looks like it's a BGP blackhole incident, possibly reminiscent of the https://en.wikipedia.org/wiki/AS_7007_incident in 1997.

As you aren’t a network engineer, I can understand making that leap based on the context, but no, this is nothing like the AS7007 event.

tpmx5y ago

So it's not a BGP blackhole incident then?

jsjohnst5y ago

Not all BGP blackholes are the same. The AS7007 incident from over twenty years ago is an entirely different cause, and thus unrelated.

1 more reply

Darmody5y ago

Half of the internet is down. Crazy...

I can't even access the private WoW server I play.

tc3135y ago

FWIW, I can’t connect to Madden NFL online servers.

rglover5y ago

This knocked out the Starbucks app and some of their systems this morning. A bunch of people in line couldn't log in and they were saying parts of their whole internal system were down, too.

EE84M3i5y ago

I'm confused about why Cloudflare had problems but other CDN providers/sites with private CDNs like Google did not. Is there something different about how Cloudflare operates?

blooalien5y ago

I experienced this issue while reading docs at "Read the Docs" (and ironically had connection issues while trying to read this very exact page right here, too.)

system25y ago

I am having trouble with Hulu right now. I bet it is related.

dancemethis5y ago

Probably due to the incredibly ugly name this company has. No one in their right mind should shake hands with a thing called Level 3.

bovermyer5y ago

SalesForce/Office365 is also having trouble.

corford5y ago

No impact here in Lisbon, PT (using MEO). I can access: HN, twitter, cloudflare, AWS, DO, Hetzner, DDG, Scaleway etc.

tictok45y ago

This (thread) explains why we've been having internet problems this morning.... lots of sites not working.

jsumrall5y ago

The iDeal payment network used by most online stores of the Netherlands was down/flaky all afternoon.

TreeInBuxton5y ago

Looks like an issue with AS3356, they are advertising stale routes - lots of unrelated services impacted

2fast4you5y ago

Centurylink is my isp, it looks like traffic drops out after 2 hops. It’s been this way for a few hours

2fast4you5y ago

Youtube is still trucking though, not sure how that works

_8j505y ago

Youtube colocates at most major ISPs on the planet, that might help.

t0mas885y ago

They have servers inside a lot of ISPs. Same for Netflix.

adamcharnock5y ago

They probably peer into Google at the local IX/Data centre. Google traffic will therefore take a different path which isn’t suffering the current outage.

CarCooler5y ago

Yep, internet has been horrible out here, I had to use Cloudflare DNS to reach websites!

eatmyshorts5y ago

nurettin5y ago

karpolan5y ago

Deployment to Netlify fails on installing of any version of Node :)

_fool5y ago

more specifically, npmjs.com and nodejs.org are not available from Netlify's datacenter due to this outage.

ausjke5y ago

I wasted two hours for this, diagnosis, reboots,etc.

person_of_color5y ago

Imagine a ransomware attack against these jokers.

ezconnect5y ago

Namecheap is also having network connection issues.

pgoodjohn5y ago

Pressing F for everyone else who was on call today

skee00835y ago

Good. It's about time ISP switched to ipv6.

chkaloon5y ago

Wonder if that's that why Feedly is down

pinkano5y ago

Yes

tiernano5y ago

1.1.1.1 warp is having issues too...

mathieubordere5y ago

stackoverflow seems to be unreachable

ihatecloudflare5y ago

It's probably just another daily outage at CloudFlare, they are famous for their the most unreliable infrastructure on the entire planet.

ramshanker5y ago

I hope these kind of “ipv4” only outages encourages more and more websites to upgrade to ipv6.

#OutageBenefit ;)

itguy43213655y ago

This doesn't have anything to do with IPv4 vs IPv6. It is a routing issue with BGP. To give an analogy,

If the city of CenturyLink (AS3356) shut down traffic, either on purpose or on accident.

...then it doesn't matter if your house number / IP address is a 32bit number or a 128bit number because traffic needs to take a different route.

This is what everyone is worried about BGP routes, not IP addresses.

cuu5085y ago

Sadly, in my experience, ipv4 is generally more reliable than ipv6 still.

bigdict5y ago

Why is that?

chaboud5y ago

We’ve observed this in end-user devices, especially on some ISPs.

It makes sense if the overall adoption and resource allocation are comparatively smaller, making individual or small-group coincident spikes more impactful against the amortized whole.

It’s a lot like a market with low volume/liquidity. Someone wanders in with a big transaction and blows everything up.

eskaytwo5y ago

It would appear from the limited info so far, to be an issue in the v4 routing configuration - I haven’t seen anything that says this couldn’t have been the other way around.

iso12105y ago

Few people care if ipv6 breaks so it doesn't make headlines

tpmx5y ago

How the xxxx did it take CenturyLink/Level3 like 3-4 hours to fix this problem?

Also: BGP probably needs to redesigned from the ground up by software engineers with experience from designing systems that can remain working with hostile actors.

q3k5y ago

> Also: BGP probably needs to redesigned from the ground up by software engineers with experience from designing systems that can remain working with hostile actors.

This has been attempted a number of times, but this is a political problem, not a technical problem: there's no single agreed source of truth for routing policy.

Without even knowing for sure who a given block belongs to, or who's allowed to announce it, or where, how do you want to fix any issues with a new dynamic routing protocol?

kitteh5y ago

RPKI is a totally diff problem here, though.

If people refuse to sign ROAs, then they don't get protection. The ARIN TAL thing is real and people have to keep fighting that.

As it is right now you can xfer v4 out of ARIN but not v6. So even if you wanted to you can't.

tpmx5y ago

sneak5y ago

Having a single, cryptographically assured source of truth for routing data is a turnkey censorship nightmare waiting to happen.

All it takes is a national military to care enough to put pressure on the database operator, legal or otherwise, and suddenly your legitimate routes are no longer accepted.

If you think this wouldn't be used to shut down things like future Snowden-style leaks or Wikileaks or The Shadow Brokers, you may not have been paying attention to the news.

2 more replies

q3k5y ago

kazen445y ago

and who should be the spearhead of this coalition?

DyslexicAtheist5y ago

> Also: BGP probably needs to redesigned from the ground up

SCION from ETH Zurich:

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

tpmx5y ago

Based on what I've seen: They essentially "shut down the Internet" for probably a quarter of the global population for about 3-4 hours.

As an example: Being in Sweden with an ISP that uses Telia Carrier for connectivity things started working around the time of https://twitter.com/TeliaCarrier/status/1300074378378518528

swinglock5y ago

Seems they didn't even get around to doing so, rather asking other carriers to stop peering with them.

https://twitter.com/TeliaCarrier/status/1300074378378518528?...

matsur5y ago

CenturyLink requested depeering to give them some breathing room and stop the bleeding. Hug ops.

tpmx5y ago

I realize I'm going to get insanely downvoted by the elite internetworking crowd again but I think this needs to be said.

From an outsider's POV: There seems to be a very strange and almost incestual relationship between the networking companies. Or maybe it's just their hangaround supporters? I dunno.

j / k navigate · click thread line to collapse