Scaling Linux Services: Before accepting connections (opens in new tab)

(theojulienne.io)

175 pointstheojulienne5y ago29 comments

29 comments

Nice, although if you want to explore networking with ad hoc tracing tools, please try bpftrace[0]. Only use BCC once you need argparse and other python libraries.

Here's my bpftrace SYN backlog tool from BPF Performance Tools (2019 book, tools are online[1]):

  # tcpsynbl.bt
  Attaching 4 probes...
  Tracing SYN backlog size. Ctrl-C to end.
  ^C
  @backlog[backlog limit]: histogram of backlog size

  @backlog[128]: 
  [0]                    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

  @backlog[500]: 
  [0]                 2783 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [1]                    9 |                                                    |
  [2, 4)                 4 |                                                    |
  [4, 8)                 1 |                                                    |

The source:

  #!/usr/local/bin/bpftrace
  
  #include <net/sock.h>
  
  BEGIN
  {
          printf("Tracing SYN backlog size. Ctrl-C to end.\n");
  }
  
  kprobe:tcp_v4_syn_recv_sock,
  kprobe:tcp_v6_syn_recv_sock
  {
          $sock = (struct sock *)arg0;
          @backlog[$sock->sk_max_ack_backlog & 0xffffffff] =
              hist($sock->sk_ack_backlog);
          if ($sock->sk_ack_backlog > $sock->sk_max_ack_backlog) {
                  time("%H:%M:%S dropping a SYN.\n");
          }
  }
  
  END
  {
          printf("\n@backlog[backlog limit]: histogram of backlog size\n");
  }

This bpftrace tool is only 24 lines. The BCC tools in this post are >200 lines (and complex: needing to worry about bpf_probe_read() etc). The bpftrace version can also be easily modified to include extra details. I'm summarizing backlog length as a histogram since our prod hosts can accept thousands of connections per second.

[0] https://github.com/iovisor/bpftrace [1] https://github.com/brendangregg/bpf-perf-tools-book

bashook5y ago

Thanks Brendan for all your work on performance analysis and BPF. I cite your work often to team mates. Your work is an invaluable resource. Seeing responses on Hacker News like this is why I keep coming back here.

theojulienneOP5y ago

Thanks for the suggestion! I did come across the `tcpsynbl.bt` script as I was writing up this post, but wanted to add the additional information around namespaces and report additional information, which didn't seem as trivial in `bpftrace` as it was in Python, but that might be my lack of familiarity with the DSL :)

brendangregg5y ago

If it's a common use case it's trivial, and if it's not yet trivial we'll make it trivial. :) Niche functionality that doesn't fit well can be deferred to BCC.

easytiger5y ago

A lot of the world still have to use RHEL 6b etc and don't have these tools available

Twirrim5y ago

Just a general observation, if you're on RHEL6 you've got around 4 months left until End of Life. (I know, there are folks out there still running CentOS 4 and prior)

easytiger5y ago

This is not quite accurate. Large institutions with very slow processes and onerous governance will be very much tied to RHEL 6 for some years. It indeed is a very important part of Redhat's business model. Enterprises will purchase extended support for RHEL 6 going up to 2024

1 more reply

IgorPartola5y ago

This is well written. I never gave much thought to the resource usage during the period between SYN and accept. This article explained it very nicely. Also, now I’m curious why don’t these Linux limits don’t scale with the amount of RAM available? Like, yes on a low resource machine you wouldn’t want more than the default 128 for the backlog. But if I have 512GB of RAM then why not give me a backlog of a few thousand?

JoshTriplett5y ago

In general, Linux does favor automatic defaults over fixed static settings, if there's a reasonable heuristic to produce those defaults. But suppose, for instance, that you can't actually handle that many connections? There are two possibilities here: one is that you are processing connections fast enough to keep up, and the other is that you're not keeping up at all. In the former case, scaling the backlog up may help you keep up, though you may already have unacceptable latency. In the latter case, no amount of backlog will help you, and the backlog may make an attacker's job easier.

That said, there might well be a case for automatic backlog scaling. Or, for that matter, for increasing the default.

yjftsjthsd-h5y ago

Obvious option: you might want that memory for processes. Dynamic scaling does seem sane, though.

ampersandy5y ago

Is there a particular reason the Linux kernel favors names like `somaxconn` instead of `socket_max_connections`? It seems like a rather straightforward improvement for readability; so, why are shorter, compressed names preferred?

codys5y ago

This particular name originates from BSD 4.2 [1], which was released in 1983. (For some context, GCC 1.0 is from 1987, pcc was used to build BSD 4.2. The first Linux release was 1991).

1: https://github.com/dspinellis/unix-history-repo/blob/0f4556f...

mehrdadn5y ago

That would make it too easy to understand.

fabian2k5y ago

What are some very rough estimates on when it makes sense to look at these low-level network settings when scaling an application? I assume the default settings are good enough for moderate loads, but at which point does this stuff become a bottleneck?

Are the default setting here reasonable for most cases, or is it more like something that you should tune even if you're not really pushing any limits?

VWWHFSfQ5y ago

My NGINX webserver configuration on AWS behind an ALB is:

/etc/sysctl.conf:

    net.core.wmem_max = 12582912
    net.core.rmem_max = 12582912
    net.ipv4.tcp_rmem = 10240 87380 12582912
    net.ipv4.tcp_wmem = 10240 87380 12582912
    fs.file-max = 1000000
    net.ipv4.ip_local_port_range = 1024 65535
    net.ipv4.tcp_tw_recycle = 1
    net.ipv4.tcp_tw_reuse = 1
    net.ipv4.tcp_max_syn_backlog = 262144
    net.ipv4.tcp_syncookies = 0
    net.ipv4.tcp_fin_timeout = 3
    net.ipv4.tcp_syn_retries = 2
    net.ipv4.tcp_synack_retries = 2
    net.ipv4.tcp_no_metrics_save = 1
    net.ipv4.tcp_max_orphans = 262144
    net.core.somaxconn = 1000000

nginx.conf (just the relevant directives):

    worker_rlimit_nofile 102400;

    events {
      worker_connections 102400;
      multi_accept on;
    }

    http {
        server {
          listen 80 default_server reuseport backlog=102400;
          ...
        }    
    }

As you can see, the socket and backlog-related values have been cranked way up. I've never had any problems with this configuration. Because these servers are behind and ALB I don't know how relevant they are since the SYN and SYN-ACK relation to RTT is between the server and the load balancer, not the remote clients. But I could be wrong. Maybe there's something I'm missing. But I've never had a problem, and I've never had any performance problems related to TCP connections in the kernel or NGINX.

shanemhansen5y ago

I think for ALB you'll see pooled connections (http or http2) so I would expect the number of TCP connections to stay pretty low. In http2 it could theoretically be as low as one.

marcosdumay5y ago

Gerenaly those will start to hurt on the single digit thousands of connections per second (per process). I'd say that it's much more relevant that you start monitoring those logs when you reach single digit hundreds of connections per second than that you set a point to act. (100s of connections/second is a pretty normal "just got traction" value, so if you see a steady usage, monitor.)

Of course, YMMV. High latency networks reduce those numbers.

Anyway, I don't see why the numbers aren't 100 times larger by default, but there's probably a reason.

eximius5y ago

Why do these values default so low? Seems like you could have a few hundred or 1k instead of 128 with relatively little memory overhead.

cutemonster5y ago

I've wondered about that me too. What if it's because in the 1990s there wasn't that much memory

lathiat5y ago

This was a great read and introduction with great practical examples, great graphics and bpf tracing. I loved it.

batter5y ago

I wish saw this when i was preparing one of the apps to huge traffic spikes. Did almost all of the changes described. But took some time to search and understand each of them. End result was pretty good.

marcosdumay5y ago

Yeah, just more reasons for microkernels and keeping those things in userspace, where applications can come with values that suit their usage.

yxhuvud5y ago

I wonder how io_uring changes this equation. The ability to have multiple concurrent accepts queued up could push the limits around a bit.

jeffbee5y ago

A good advertisement for userspace networking.

cranekam5y ago

Why? Because this stuff is in the kernel and thus harder to see? I don’t expect moving it to userspace to reduce overall complexity —- just move it elsewhere.

jeffbee5y ago

Because it’s in the kernel you aren’t exposed to all the knobs. There are hundreds of parameters controlling Linux tcp behavior. Even experts overlook some aspects. Hoisting this up into your application makes it visible. Why should there be a system parameter that limits the accept backlog of your server, silently? It makes no sense.

stjohnswarts5y ago

Because the people who wrote it think it's better being a bit more inaccessible in the kernel so that regular users don't shoot themselves in the foot thinking they know better than the designers what the values should be. The people who know what they're doing will be able to set the parameters regardless of where they're hiding.

1 more reply

j / k navigate · click thread line to collapse

29 comments

brendangregg5y ago

Nice, although if you want to explore networking with ad hoc tracing tools, please try bpftrace[0]. Only use BCC once you need argparse and other python libraries.

Here's my bpftrace SYN backlog tool from BPF Performance Tools (2019 book, tools are online[1]):

  # tcpsynbl.bt
  Attaching 4 probes...
  Tracing SYN backlog size. Ctrl-C to end.
  ^C
  @backlog[backlog limit]: histogram of backlog size

  @backlog[128]: 
  [0]                    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

  @backlog[500]: 
  [0]                 2783 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [1]                    9 |                                                    |
  [2, 4)                 4 |                                                    |
  [4, 8)                 1 |                                                    |

The source:

  #!/usr/local/bin/bpftrace
  
  #include <net/sock.h>
  
  BEGIN
  {
          printf("Tracing SYN backlog size. Ctrl-C to end.\n");
  }
  
  kprobe:tcp_v4_syn_recv_sock,
  kprobe:tcp_v6_syn_recv_sock
  {
          $sock = (struct sock *)arg0;
          @backlog[$sock->sk_max_ack_backlog & 0xffffffff] =
              hist($sock->sk_ack_backlog);
          if ($sock->sk_ack_backlog > $sock->sk_max_ack_backlog) {
                  time("%H:%M:%S dropping a SYN.\n");
          }
  }
  
  END
  {
          printf("\n@backlog[backlog limit]: histogram of backlog size\n");
  }

[0] https://github.com/iovisor/bpftrace [1] https://github.com/brendangregg/bpf-perf-tools-book

bashook5y ago

theojulienneOP5y ago

brendangregg5y ago

If it's a common use case it's trivial, and if it's not yet trivial we'll make it trivial. :) Niche functionality that doesn't fit well can be deferred to BCC.

easytiger5y ago

A lot of the world still have to use RHEL 6b etc and don't have these tools available

Twirrim5y ago

Just a general observation, if you're on RHEL6 you've got around 4 months left until End of Life. (I know, there are folks out there still running CentOS 4 and prior)

easytiger5y ago

1 more reply

IgorPartola5y ago

JoshTriplett5y ago

That said, there might well be a case for automatic backlog scaling. Or, for that matter, for increasing the default.

yjftsjthsd-h5y ago

Obvious option: you might want that memory for processes. Dynamic scaling does seem sane, though.

ampersandy5y ago

codys5y ago

This particular name originates from BSD 4.2 [1], which was released in 1983. (For some context, GCC 1.0 is from 1987, pcc was used to build BSD 4.2. The first Linux release was 1991).

1: https://github.com/dspinellis/unix-history-repo/blob/0f4556f...

mehrdadn5y ago

That would make it too easy to understand.

fabian2k5y ago

Are the default setting here reasonable for most cases, or is it more like something that you should tune even if you're not really pushing any limits?

VWWHFSfQ5y ago

My NGINX webserver configuration on AWS behind an ALB is:

/etc/sysctl.conf:

    net.core.wmem_max = 12582912
    net.core.rmem_max = 12582912
    net.ipv4.tcp_rmem = 10240 87380 12582912
    net.ipv4.tcp_wmem = 10240 87380 12582912
    fs.file-max = 1000000
    net.ipv4.ip_local_port_range = 1024 65535
    net.ipv4.tcp_tw_recycle = 1
    net.ipv4.tcp_tw_reuse = 1
    net.ipv4.tcp_max_syn_backlog = 262144
    net.ipv4.tcp_syncookies = 0
    net.ipv4.tcp_fin_timeout = 3
    net.ipv4.tcp_syn_retries = 2
    net.ipv4.tcp_synack_retries = 2
    net.ipv4.tcp_no_metrics_save = 1
    net.ipv4.tcp_max_orphans = 262144
    net.core.somaxconn = 1000000

nginx.conf (just the relevant directives):

    worker_rlimit_nofile 102400;

    events {
      worker_connections 102400;
      multi_accept on;
    }

    http {
        server {
          listen 80 default_server reuseport backlog=102400;
          ...
        }    
    }

shanemhansen5y ago

I think for ALB you'll see pooled connections (http or http2) so I would expect the number of TCP connections to stay pretty low. In http2 it could theoretically be as low as one.

marcosdumay5y ago

Of course, YMMV. High latency networks reduce those numbers.

Anyway, I don't see why the numbers aren't 100 times larger by default, but there's probably a reason.

eximius5y ago

Why do these values default so low? Seems like you could have a few hundred or 1k instead of 128 with relatively little memory overhead.

cutemonster5y ago

I've wondered about that me too. What if it's because in the 1990s there wasn't that much memory

lathiat5y ago

This was a great read and introduction with great practical examples, great graphics and bpf tracing. I loved it.

batter5y ago

marcosdumay5y ago

Yeah, just more reasons for microkernels and keeping those things in userspace, where applications can come with values that suit their usage.

yxhuvud5y ago

I wonder how io_uring changes this equation. The ability to have multiple concurrent accepts queued up could push the limits around a bit.

jeffbee5y ago

A good advertisement for userspace networking.

cranekam5y ago

Why? Because this stuff is in the kernel and thus harder to see? I don’t expect moving it to userspace to reduce overall complexity —- just move it elsewhere.

jeffbee5y ago

stjohnswarts5y ago

1 more reply

j / k navigate · click thread line to collapse