Mediocre Engineer's Guide to HTTPS (opens in new tab)

(devonperoutky.super.site)

314 pointsMediumD1y ago35 comments

35 comments

Tangential question from a layman: when I lose access to a particular website, or the internet as a whole, why is it so hard to tell where in the chain the failure is occurring? Like it’s often unclear whether

* I’ve got a network misconfiguration on my local machine;

* My wifi connection to the router is down;

* The cable between my router and ISP is cut;

* My ISP is having large scale issues; or

* The website I’m trying to reach is down.

I’ve been given the vague impression that it has something to do with a non-deterministic path by which requests are routed, but this seems unconvincing. If some link on the path breaks, why doesn’t the last good link send a message backward that says “Your message made it to me, but I tried to send it the next step and it failed there.”

treflop1y ago

It’s possible to figure out exactly what failed if you know how it all works.

But to write a tool to provides a useful description to the user is near impossible because no two setups are the same, it’s not possible to know if something is intentional or not, and it can be dangerous to just make an assumption based on what the common causes are and just suggest to the user a completely wrong answer.

For example, let’s say you can’t connect to a website because the DNS server isn’t responding and the host isn’t responding. You could tell the user that something is probably misconfigured at your router or your ISP is having some issues.

However, it turns out that the actual reason was that your VPN client updated your local routing tables and DNS server but failed to remove the changes when you quit the client. How is a troubleshooter supposed to know that the settings were temporarily changed versus it being the permanent ones?

Once you try to start to write a troubleshooter that can identify the actual cause, you realize that it’s very difficult due to the complexity and variation. At best you can write something that usually spits out a correct answer but also sometimes suggests something totally wrong and leads people down a completely wrong path.

jessriedel1y ago

If Google dedicated 10 engineers full time to this problem for 3 years, could they solve it?

avoid3d1y ago

I work for an acquired startup that tried to solve this problem.

It’s been around 8 years and we’re up to 50 or so people. I’d say we are okay at it.

We haven’t gotten fundamentally better over time recently, it’s more like there is some asymptote of how much you can really tell with a certain amount of insight into the systems between source and destination.

The only real progress we’ve made has been integrating with more and more sources of information about the state of the network.

1 more reply

awesomeMilou1y ago

Yes, and they partially have. Browsers are great at telling you where the chain has failed/ been cut, though some error messages seem to be intentionally uninformative as provided information would be meaningless to your average user.

That said, from an enthusiast perspective, running traceroute to the nearest google service (1e100.net for example) will already give you a huge tip on where things went wrong.

1 more reply

otabdeveloper41y ago

As long as you only ever visit Google web properties, yes.

evilDagmar1y ago

Short answer: No.

nurple1y ago

If ICMP is allowed into your network, your machine will most likely receive a Destination Unreachable response from the host that can't forward the packet further.

Your application won't see the ICMP message unless you configure the socket to report them(these are considered "transient" errors). On Linux this is done via the socket option IP_RECVERR.

ETA: there's not a ton of value collecting errors at this layer when you're working at L7. The errors that _do_ get surfaced for DU at your layer will be appropriate for the failure handling logic you'll inevitably have already. In this case I think it'd be a timeout, as other layers implement retries in the face of unreachable destinations.

I found these RFCs helpful re: how the TCP layer handles ICMP errors: https://www.rfc-editor.org/rfc/rfc1122#page-103

Section 4.2.3.9:

> Since these Unreachable messages indicate soft error conditions, TCP MUST NOT abort the connection, and it SHOULD make the information available to the application.

> DISCUSSION: TCP could report the soft error condition to the application layer with an upcall to the ERROR_REPORT routine, or it could merely note the message and report it to the application only when and if the TCP connection times out.

This one gets into the nitty gritty of how the stacks interact in order to study ICMP as vector for TCP attacks.

https://www.rfc-editor.org/rfc/rfc5927

cancerhacker1y ago

The browser reports the error closest to what it was doing at the time - host not found? Well, the network was reliable enough to reach a dns server that returned that the lack of address for a name. But if the dns server itself can’t come reached, it’s some sort of network error between you and that server. The typical way to diagnose that kind of problem is to perform all the steps yourself - can I ping the dns server address? Can I resolve this host with that dns server? What about a different dns server, maybe that particular name is being excluded because of corporate policy. The command line tools ping, traceroute and dig are useful if you want to get into it.

itscrush1y ago

Much of this problem space I've solved with running MTR to the destination when troubleshooting to see each hop's detail.

It's like ping + traceroute in a live running session with each hop broken down.

Quite consistent when I am the first to notice a node down on Xfinity network and in the same mtr see my network at least to my modem is good. Or when there's a hop beyond my ISP with 100s of ms added latency, which I haven't seen other tools do well like MTR can.

Won't solve everything, but might be worth your checking in your case as it breaks down per-hop providing latency for each.

AlienRobot1y ago

How are you trying to tell that?

If a web browser can't access a URL, it won't tell you why exactly because there's a chance it diagnosis the reason wrong and most users will be confused by that. I assume most diagnosis tools work the same way. You need to make assumptions about how the OS, hardware, and network are configured to be able to say "the problem is here."

For example, when you access a website, the first thing that needs to be done is check a domain name server (DNS) to get the IP address of the web server. But where does the web browser get the DNS IPs from? You can configure it in the browser. Or in the OS. Or in your router. Or in your modem. And if you don't, it gets them from the DHCP server the router connects to, which could be your ISP's DHCP server (then you get your ISP's default DNS) or it could also be some other router in an organization's network.

If the DNS seems wrong it's easy to tell the IP is wrong but it gets hard to say where that IP came from.

Even SSL could be a problem with the server having the wrong certificates or it could be your computer having the wrong certificates.

arccy1y ago

http(s) is built on top of multiple layers (HTTP, TLS, TCP, Ethernet...). A broken link in the lower layers can't really be presented as a higher level message (because it has no access to it).

harry_ord1y ago

Not a network person, only played with trace route a long time ago but I'm pretty sure that only really happens if you explicitly ask for information about all the middle men.

Most of the time a lot of software kinda doesn't care about what's happening just if it can do what it's told.

For Websites you often get more informative errors like 404, 500 or something else.

recursive1y ago

If you're getting a status code like 404 or 500, it means there's no problem between you and the web server. The status codes come from the server. The exception is when you get a gateway/reverse proxy error. Usually 503 I think. That means the web server is down, but there's another server in front of it reporting that it's down.

harry_ord1y ago

True, I thought of those as they're just more informative about why you're not getting what you're looking for.

YZF1y ago

502 Bad Gateway.

YZF1y ago

For most people most issues would in at their home network. So that's a good first guess for any connectivity problems. Rarely it would be somewhere between your home and the ISP. If it's a small rural ISP then it might be ISP->Internet though I'd think that's rare. Most large scale ISPs have enough redundancy and capacity.

As someone else mentioned ICMP addresses certain classes of failures if enabled but I think the historical reason is more along the lines of the Internet was meant to run over lossy connections. For example, when a certain link is saturated routers will just start dropping packets. Reporting each dropped packet back to the sender is just not a good idea, it adds load to a system already potentially operating at capacity. TCP assumes packets can get lost and retransmits them. When a link goes down routing protocols will potentially send those retransmitted packets over a different link/path. I.e. there's no real concept of "connection down" other than the application layer or TCP eventually giving up (which can take a very long time). The kind of ICMP message that will immediately terminate a connection is when the server machine doesn't have anything listening on the destination port.

boffinAudio1y ago

Cyclomatic Complexity is why your Operating System can't do this for you.

https://en.wikipedia.org/wiki/Cyclomatic_complexity

There are so many different paths for an error case to follow.

You can of course debug this by reducing the complexity - for example, by watching one of the links in the chain (say, DNS) and seeing if it is failing - but this is the realm of network engineers who get paid mightily to get through this cyclomatic complexity and work at the relevant layers, all the way down to the atoms in the pipe ..

>If some link on the path breaks, why doesn’t the last good link send a message backward that says “Your message made it to me, but I tried to send it the next step and it failed there.”

In fact, the links all do this, but there is simply no provision in your OS - no fancy GUI, perhaps - that allows you to fully understand this without getting overwhelmed by the cyclomatic complexity. Tools exist, and once you learn to use them to tame the complexity - congrats, you're now worth $300k/yr and can go work in San Francisco .. /s ;)

StrLght1y ago

Might be relevant: there's also detailed and somewhat interactive byte-by-byte example of TLS for TLSv1.2[0] and TLSv1.3[1]. I absolutely love it and highly recommend checking it out if you want to learn more about TLS.

[0]: https://tls12.xargs.org/

[1]: https://tls13.xargs.org/

jonwest1y ago

Does anyone have more examples of articles written in this perspective? Regardless of my experience level I love diving through “ELI(a mediocre engineer)” type explanations as I either learn another piece that wasn’t completely clear, or gives me another set of examples to help explain it to other people. Either way they’re generally very helpful.

m1keil1y ago

This article is largely feels like a 3 to 4 Cloudflare blog/articles summarisation. If you want more of this stuff, check CF's learning centre: https://www.cloudflare.com/learning/

Examples: https://www.cloudflare.com/learning/dns/what-is-dns/ https://www.cloudflare.com/learning/ssl/transport-layer-secu... https://www.cloudflare.com/learning/performance/what-is-http...

Snawoot1y ago

> The client generates a premaster secret, encrypts it with the server’s public key, and sends it to the server.

It's already not true for, like, ages.

Operyl1y ago

Down below it says this:

> Everything you’ve learned here is a lie.

> The process we just describe is for the original version of TLS, which is outdated compared to the more modern version of TLS 1.3.

debo_1y ago

> aka. Writing HTTP requests from San Francisco for $300K/year

Best part of the article!

raxxorraxor1y ago

> Current version of TLS (>1.3) do not support RSA (and various other cipher suites) for security reasons.

That is true for the key exchange part because RSA does not offer forward security. For signatures RSA is still used and probably still the most widely spread type of x509 certs.

I know Safari just upped the requirements to 2048bit keys for RSA not too long ago (for signatures).

wonnage1y ago

This reads like an AI summary of an actual HTTPS explainer. Terms get introduced with no context - no explanation of what a certificate is or how the chain of trust works, assumes the reader knows about public key cryptography, describes six out of the seven OSI layers (RIP presentation layer) without mentioning that term at all, etc.

TBF it is titled as mediocre!

MediumDOP1y ago

To be fair, I also didn’t include the session layer!

My writing isn’t a strength of mine, so I appreciate the criticism. My writing going from “bad” -> “is it AI?” is progress.

I struggled with where to “cutoff” the explanation and public key cryptography seemed like a good boundary and better explained elsewhere, as did various OSI layers.

I probably should have gone over the cert and potentially the full chain of trust, I’ll give you that.

pietrod1y ago

I'm unable to find some code where it shows how to verify the signature of SHA256(client_hello_random + server_hello_random + curve_info + public_key) I know the theory but somehow there is some issue to implement it, anybody can link an actual toy program showing practically how to do this?

deathanatos1y ago

> By agreeing on all these algorithms, exchanging random seeds, and the server’s SSL certificate containing the private key;

I sure hope not. But I suppose it is titled "Mediocre Engineer".

> $300K/year

… I'll undercut you by $50k/y; where do I apply?

(There are just more and more errors. TLS <1.3 doesn't even work the way it describes, even though it tries to throw newer stuff into 1.3. The DNS section describes a recursive resolver, but the client isn't going to do that. It is probably talking to a stub resolver, too. "Internet Layer". The implication of "brotli" being a widely used algorithm in a ciphersuite/in TLS's compression, "Current version of TLS (>1.3) do not support RSA" …

… these sorts of blogspam are why I wish sometimes that there was a downvote. The advert isn't so obnoxious as to make me want to flag is low enough. I guess I should write the less mediocre article and make the HN frontpage. If only I made $300K/y, I'd have more time.)

_ache_1y ago

Everything in that article is a little outdated, 30% of web request are in HTTP3 now a day with CORS. There is no date of publication.

recursive1y ago

30% of requests are CORS? Surely this depends on what type of development you're doing. I'm doing SaaS development for systems generally deployed inside corporate networks. Very close to 0% of requests are CORS. Same for HTTP3.

_ache_1y ago

I said 30% of the requests on the web use HTTP3. And now a day CORS and other mechanisms that are not cited in the articles.

j / k navigate · click thread line to collapse

35 comments

jessriedel1y ago

* I’ve got a network misconfiguration on my local machine;

* My wifi connection to the router is down;

* The cable between my router and ISP is cut;

* My ISP is having large scale issues; or

* The website I’m trying to reach is down.

treflop1y ago

It’s possible to figure out exactly what failed if you know how it all works.

jessriedel1y ago

If Google dedicated 10 engineers full time to this problem for 3 years, could they solve it?

avoid3d1y ago

I work for an acquired startup that tried to solve this problem.

It’s been around 8 years and we’re up to 50 or so people. I’d say we are okay at it.

The only real progress we’ve made has been integrating with more and more sources of information about the state of the network.

1 more reply

awesomeMilou1y ago

That said, from an enthusiast perspective, running traceroute to the nearest google service (1e100.net for example) will already give you a huge tip on where things went wrong.

1 more reply

otabdeveloper41y ago

As long as you only ever visit Google web properties, yes.

evilDagmar1y ago

Short answer: No.

nurple1y ago

If ICMP is allowed into your network, your machine will most likely receive a Destination Unreachable response from the host that can't forward the packet further.

Your application won't see the ICMP message unless you configure the socket to report them(these are considered "transient" errors). On Linux this is done via the socket option IP_RECVERR.

I found these RFCs helpful re: how the TCP layer handles ICMP errors: https://www.rfc-editor.org/rfc/rfc1122#page-103

Section 4.2.3.9:

> Since these Unreachable messages indicate soft error conditions, TCP MUST NOT abort the connection, and it SHOULD make the information available to the application.

This one gets into the nitty gritty of how the stacks interact in order to study ICMP as vector for TCP attacks.

https://www.rfc-editor.org/rfc/rfc5927

cancerhacker1y ago

itscrush1y ago

Much of this problem space I've solved with running MTR to the destination when troubleshooting to see each hop's detail.

It's like ping + traceroute in a live running session with each hop broken down.

Won't solve everything, but might be worth your checking in your case as it breaks down per-hop providing latency for each.

AlienRobot1y ago

How are you trying to tell that?

If the DNS seems wrong it's easy to tell the IP is wrong but it gets hard to say where that IP came from.

Even SSL could be a problem with the server having the wrong certificates or it could be your computer having the wrong certificates.

arccy1y ago

http(s) is built on top of multiple layers (HTTP, TLS, TCP, Ethernet...). A broken link in the lower layers can't really be presented as a higher level message (because it has no access to it).

harry_ord1y ago

Not a network person, only played with trace route a long time ago but I'm pretty sure that only really happens if you explicitly ask for information about all the middle men.

Most of the time a lot of software kinda doesn't care about what's happening just if it can do what it's told.

For Websites you often get more informative errors like 404, 500 or something else.

recursive1y ago

harry_ord1y ago

True, I thought of those as they're just more informative about why you're not getting what you're looking for.

YZF1y ago

502 Bad Gateway.

YZF1y ago

boffinAudio1y ago

Cyclomatic Complexity is why your Operating System can't do this for you.

https://en.wikipedia.org/wiki/Cyclomatic_complexity

There are so many different paths for an error case to follow.

>If some link on the path breaks, why doesn’t the last good link send a message backward that says “Your message made it to me, but I tried to send it the next step and it failed there.”

StrLght1y ago

[0]: https://tls12.xargs.org/

[1]: https://tls13.xargs.org/

jonwest1y ago

m1keil1y ago

This article is largely feels like a 3 to 4 Cloudflare blog/articles summarisation. If you want more of this stuff, check CF's learning centre: https://www.cloudflare.com/learning/

Examples: https://www.cloudflare.com/learning/dns/what-is-dns/ https://www.cloudflare.com/learning/ssl/transport-layer-secu... https://www.cloudflare.com/learning/performance/what-is-http...

Snawoot1y ago

> The client generates a premaster secret, encrypts it with the server’s public key, and sends it to the server.

It's already not true for, like, ages.

Operyl1y ago

Down below it says this:

> Everything you’ve learned here is a lie.

> The process we just describe is for the original version of TLS, which is outdated compared to the more modern version of TLS 1.3.

debo_1y ago

> aka. Writing HTTP requests from San Francisco for $300K/year

Best part of the article!

raxxorraxor1y ago

> Current version of TLS (>1.3) do not support RSA (and various other cipher suites) for security reasons.

That is true for the key exchange part because RSA does not offer forward security. For signatures RSA is still used and probably still the most widely spread type of x509 certs.

I know Safari just upped the requirements to 2048bit keys for RSA not too long ago (for signatures).

wonnage1y ago

TBF it is titled as mediocre!

MediumDOP1y ago

To be fair, I also didn’t include the session layer!

My writing isn’t a strength of mine, so I appreciate the criticism. My writing going from “bad” -> “is it AI?” is progress.

I struggled with where to “cutoff” the explanation and public key cryptography seemed like a good boundary and better explained elsewhere, as did various OSI layers.

I probably should have gone over the cert and potentially the full chain of trust, I’ll give you that.

pietrod1y ago

deathanatos1y ago

> By agreeing on all these algorithms, exchanging random seeds, and the server’s SSL certificate containing the private key;

I sure hope not. But I suppose it is titled "Mediocre Engineer".

> $300K/year

… I'll undercut you by $50k/y; where do I apply?

_ache_1y ago

Everything in that article is a little outdated, 30% of web request are in HTTP3 now a day with CORS. There is no date of publication.

recursive1y ago

_ache_1y ago

I said 30% of the requests on the web use HTTP3. And now a day CORS and other mechanisms that are not cited in the articles.

j / k navigate · click thread line to collapse