Surpassing 10Gb/S over Tailscale (opens in new tab)

(tailscale.com)

176 pointsmssdvd3y ago80 comments

80 comments

Pretty amazing that you can achieve such a throughput in a Golang userspace program. I wonder if other UDP based protocols like QUIC can attain those numbers as well.

ffk3y ago

Interestingly, the fastest CPU based network switches tend to do full kernel bypass. The kernel is generally slow compared to OVS and VPP, especially when they traverse over something like DPDK.

arghwhat3y ago

Kernel bypass in DPDK grants the application direct access to DMA buffers so that the kernel is no longer involved. This is not because the kernel is slow, but because many small syscalls are expensive and putting your entire app in the kernel is a bad idea.

There is no kernel bypass in wireguard-go, just a user-space implementation fast implementation with smart use of syscalls to minimize the overhead of being split between user-space and kernel-space.

With io_uring, DPDK-style kernel bypass might stop making sense altogether.

1 more reply

ilyt3y ago

It's not "kernel is slow", kernel when left to its own devices is plenty fast, the reason is that when you want to make decision about packet in userspace (vs telling kernel what to do with it via various interfaces) that kernel logic would just be overhead.

It's similar for applications; if you can, say, decode whole DNS packet in one go, you don't really want kernel to spend time decoding UDP packet, then you decoding the rest of the packet; doing it in one step is much faster.

1 more reply

renewiltord3y ago

Are there consumer (<$2k) network switches that can do Wireguard in a very fast path?

1 more reply

preseinger3y ago

go is pretty fast

in fact, i have a standing bet with some of my rustacean friends that they can't show me a typical HTTP service in rust, which has performance numbers (rps, latency, throughput) that i can't meet or beat in go

of course lots of caveats there, what does normal-ish mean, well probably most of the work is gonna be i/o bound, it should run on normal server-class hardware, et cetera et cetera

but nothing yet

api3y ago

For most compiled languages or languages with very good VMs like Java benchmarks are really testing the quality of the implementation and the depth of the implementor's understanding.

I'd bet that very good Go and Rust programmers could probably converge to almost identical performance.

What I wouldn't be on is that Go could equal Rust in the area of small memory footprint or on small devices.

3 more replies

ThePhysicist3y ago

That's my experience as well. Recently I rewrote a Golang-based QUIC server in Rust and I had a hard time getting it to perform equally well. Certainly possible but requires a lot of hand-tuning and knowing exactly what you do. In Golang you just spawn a Go routine for each request and avoid lock-based shared state as much as possible and you're mostly good, the runtime will manage all aspects like number of threads, allocations etc. for you.

One area where Rust is still better are memory-constrained environments e.g. on mobile and on microcontroller, though there's tinygo and the Go runtime gets slimmer as well, so now you can have binaries and memory footprints smaller than 5 MB on most mobile platforms, which is absolutely acceptable even for budget phones. I think Tailscale e.g. runs their modified version of wireguard-go on all mobile clients without issues.

_0w8t3y ago

Caddy (a web server written in Go) is like two times slower than Nginx on many benchmarks.

3 more replies

adam_arthur3y ago

Seems Rust places well in some composite benchmarks. Go is further down the list. Of course this depends on the quality of the implementation and doesn't account for UX/usability

https://www.techempower.com/benchmarks/#section=data-r21&tes...

1 more reply

kodah3y ago

Slack has a system called Nebula that's pretty adjacent to userspace WireGuard.

unixhero3y ago

Nebula is a Tailscale clone

1 more reply

VWWHFSfQ3y ago

is the go userspace program actually shoveling this data or are they in-kernel buffer copies a la sendfile and the like

wmf3y ago

Yes, userspace has to touch the data to encrypt/decrypt it.

red0point3y ago

What‘s missing from all these figure is the resulting latency. It‘s often the case that vendors show impressive throughput numbers, but then the latency is terrible at that throughput.

Do you have those numbers as well?

raggi3y ago

We do look at them to check on how we're doing, and I want to dig into this area more over time. In particular we don't do classful prioritization right now, which if you look at the typical tests for this they're often focused on multi-flow classifications. We also don't set specific congestion algorithms on our interfaces right now - availability is variable, as is the cost of them. You can see in the post here that Jordan documents that the tests in the blog were all explicitly over cubic.

We increased the sizes of the UDP buffers in the prior round of optimizations. The kernel defaults for UDP buffers are too small to approach the throughput discussed here - and the default sizings were the primary source of lots of dropped packets. I raised those to 7mb, which seems like an odd number, but it's the largest you can set on macOS before the kernel rejects it - likely we'll eventually head for a per-platform split. At these speeds a 7mb buffer represents up to 5ms of flow data, though this does not imply that it creates 5ms of bufferbloat - it just means that this increased buffer could itself account for 5ms in the worst non-lossy case. On the userspace side Tailscale also has some more buffer space now (we're reading and writing lists of packets at a time, not single packets), but the sizing there is more complex.

This topic in general is much more complex - in the first throughput post I originally started to dig into it, and we cut that in editing because it was making the post too dense and there wasn't space to give the topic the attention it deserves. One day we'll talk about this too. Typically right now we add very little latency, low millis or lower - we actually add more jitter than latency, as any userspace program would. It's still orders of magnitude lower than the levels which even concern a typical realtime application such as gaming or communications - for example someone was recently talking about using Tailscale on their Steamdeck while on vacation to play Hogwarts streaming from their PC.

In the meantime, a real world example for you. I have a border router that I built using a relatively cheap piece of hardware (Intel(R) Celeron(R) J4105 CPU @ 1.50GHz). It has NICs that support GRO/GSO, but the CPU is the bottleneck for throughput. The box does 563MBits/sec inbound to the LAN over Tailscale (949 Mbits/sec raw). I run this as an exit-node for my workstation all the time, even though that's in the same building - and do so for the sake of diagnosing bugs and experiencing the product full time. In my initial test today, under peak load the exit node adds 35ms of latency each way. I was surprised by this, so I checked when going direct rather than via the exit node, I see 15ms down and 30ms up of latency increase under peak load. It seems Comcast dropped some capacity since I last tuned my uplink!

I then re-tuned CAKE on the router uplink to be more aggressive resulting in a raw bloat of 0ms/0ms, and then retested with the Tailscale exit node. With these more aggressive CAKE tunings, Tailscale also stayed at 0ms/0ms. This CAKE tuning ate a chunk of throughput capacity, as expected. The specific tuning here being for a Comcast 1000/40 link, and the system CPU bound at 500mbps for forwarding:

  + tc qdisc add dev internet root handle 1: cake docsis ack-filter-aggressive nat bandwidth 40mbit lan
  + ip link add name ifbinternet type ifb
  + tc qdisc add dev internet handle ffff: ingress
  + tc qdisc add dev ifbinternet root cake bandwidth 500mbit lan
  + ip link set ifbinternet up
  + tc filter add dev internet parent ffff: matchall action mirred egress redirect dev ifbinternet

On the LAN side, between the same machines (fq_codel only, default settings), running iperf3 alongside ping:

Under max load ([ 5] 0.00-57.73 sec 3.72 GBytes 554 Mbits/sec receiver):

  10 packets transmitted, 10 received, 0% packet loss, time 9013ms
  rtt min/avg/max/mdev = 2.625/3.620/4.536/0.646 ms

Zero load:

  10 packets transmitted, 10 received, 0% packet loss, time 9014ms
  rtt min/avg/max/mdev = 0.648/0.954/1.713/0.306 ms

What do these numbers mean? In practice they mean you'll notice WiFi more than you'll notice Tailscale, but we can and will still do better over time. Here's WiFi from a MacBook to the border router on the same LAN segment (no WireGuard/Tailscale):

  10 packets transmitted, 10 packets received, 0.0% packet loss
  round-trip min/avg/max/stddev = 3.845/11.363/34.152/8.940 ms

This is already long for an HN response, and so much more to say, but I hope it helps!

nikisweeting3y ago

Very curious to learn more about CAKE tuning with tailscale, would love to see a post someday about how the two interact and when/why it might be needed?

dtaht3y ago

I look forward to more, at a longer RTT.

majke3y ago

Okay, as far as I understand this writeup.

There are two sides, userspace UDP socket to receive wg packets on. Then the tap file descriptor to receive unencrytped packets from the host OS.

To speed up the userspace UDP socket it's desirable to use UDP_GRO flag on RX, and UDP_SEGMENT flag on TX. `tx-udp-segmentation` is a HW help for the latter. No need for any checksums and stuff. This is just speedup for userspace "classic" UDP socket.

However, buffering with UDP_GRO is interesting, since you need to pass potentially large 64KiB buffer to kernel since you don't know how large the next GRO-packet is. (this is a digression)

On the tap side, the article implies they enabled TUN_F_TSO4, which is a magical offload flag on tun interface. With it it is possible to get large pakets form the host OS. This is where it gets interesting. If you get a very large block from the host, like say 14KiB or larger.... how do you push it to the wireguard socket? I guess it's nececesary to packetize it back to small-MSS packets before encrypting. That means recreating TCP headers (with seq numers) and filling the checksum. This sounds like "fun".

The same on TX side towards the host... if you get a number of TCP segments from the wg tunnel, decrypt them.... do you push them as one large TUN_F_TSO segment to tun? or do you push one-by-one and rely on the kernel to GRO them? I didn't quite get it from the article. Or maybe it's possible to send large packets over wg without segmentation?

The same discussion is about UDP. With UDP you can use TUN_F_USO, however, this is only available in kernel 6.2. This might be why there arent' too many UDP numbers in the article.

2bluesc3y ago

The missing feature from Tailscale for me is the ability to host a Tailscale only DNS zone.

They have Magic DNS, but that only works for individual Tailscale nodes. I want multiple DNA records pointing to a single Tailscale node. Would be even better if I could use my own domain (subdomain even better) instead of their long `foo-bar.ts.net` domain.

Currently need to do this manually, but seems overly redundant since Tailscale already does 90% of this with MagicDNS and is fast because it's in their client vs a remote server.

zrail3y ago

Step 1: install Tailscale and Docker on a VM or whatever

Step 2: set up a Technitium container in host networking mode

Step 3: configure Technitium with a stub zone pointing your ts.net name at 100.100.100.100

Step 4: set up a zone for whatever.tld

Step 5: set up a DNAME record for ts.whatever.tld pointing at your ts.net domain

Result: querying this new DNS server with machine.ts.whatever.tld resolves to machine.blah-foo.ts.net resolves to that machine's 100.64.0.0 address.

https://technitium.com/dns/

2bluesc3y ago

I know this can be done manually (and I do), but the issue with that is that: 1. It's manual 2. Single point of failure of this server that was needed

My point was that MagicDNS is implemented in the Tailscale client on each machine (fault tolerant, 0ms latency) and has almost all the things necessary (DNS resolver, push mechanism for record updates) except for a custom defined zone.

Running `drill @100.100.100.100 <node_name>.<magic_dns_domain>.ts.net` is 0ms because it's local, and doesn't depend on a single DNS server running somewhere on my Tailscale network.

1 more reply

madjam0023y ago

There is an open GitHub issue for this and it’s already been implemented in the Tailscale client, it’s really nice too as the DNS records are pushed out to the local DNS resolver on each Tailscale client, rather than being lookups to a separate server, so it’s super fast.

Unfortunately there aren’t any options for it on the Tailscale control panel, but if you use Headscale you can configure it and take advantage of it now.

2bluesc3y ago

I searched and couldn't find anything in the tailscale client repo. Link to the issue?

Did find headscale docs about "Setting custom DNS records"[0]. It seems only `A` and `AAAA` records are supported. This might be the start of setting up headscale this weekend.

[0] https://github.com/juanfont/headscale/blob/main/docs/dns-rec...

1 more reply

kristofferR3y ago

Tailscale is awesome, so damn recommended. Taildrop (AirDrop for everything, included in Tailscale) is especially recommended, it makes it so damn easy to send files between all your devices.

raybb3y ago

https://tailscale.com/kb/1106/taildrop/ seems to be the docs.

It's the first I hear of this. I wonder if there's any big advantage of this for someone who is already using syncthing for the same purpose? Biggest thing I could hope for is that it's faster. But I generally don't keep Tailscale running on mobile because I don't need it to and don't like the persistent notification.

klabb33y ago

Sync, continuous backup and transfer are all quite different use-cases.

Most backup/sync products are designed to work in the background and often require upload before download. I don’t know if syncthing does streaming syncs though.

Another difference is transfers can easily be untrusted, as in sender and receiver don’t need access to each others file systems. Take magic wormhole (or email attachments for that matter) as an example.

Taildrop is somewhere in between – I think you have to be on same tail net, but no need for awareness of the other device’s file system.

kristofferR3y ago

Syncthing is slower, you need to act on both devices.

With Taildrop you just need to share something with a couple of clicks, and it'll appear on the device(s) you share it to.

withinboredom3y ago

I’ve really wanted to try tailscale. I fear I’ll like it, and I don’t want another company to have a monopoly on simple things so everyone forgets how to do them.

2bluesc3y ago

Had similar feelings and did like it more then I thought I could.

My escape hatch from the monopoly is headscale[0] which I can self host.

[0] https://github.com/juanfont/headscale

xena3y ago

You can even host Headscale over Tailscale, amusingly: https://tailscale.dev/blog/headscale-funnel

beambot3y ago

I don't think I'd classify their zero-config p2p-style VPN as "simple" -- or at least, certainly not simple to replicate...

More to the point, I hope their technology becomes commonplace & gratis a la LetsEncrypt for SSL Certificates.

withinboredom3y ago

I mean, setting up a WireGuard vpn is pretty darn simple, even into a k8s cluster. It’s not rocket science or anything; which is kinda my point. They make it too easy, and that worries me.

2 more replies

necubi3y ago

Tailscale is amazing. I was able to set up our AWS VPN with it in <30 minutes, and it's just worked ever since. Getting new users set up is similarly seamless.

If this means I continue to forget how to run OpenVPN I consider that well worth it.

thenipper3y ago

It's made putting internal apps in a private subnet on a VPC a very trivial process. Like took me an afternoon and works well for my small 40 person company.

rektide3y ago

Userspace networking makes me a bit sad in that it's much harder for users to observe or instrument. It's convenient for app developers, but to lock users out of seeing what is happening on their own systems feels awful.

database641283y ago

Why does the in-kernel WireGuard perform so much worse on the AWS instances?

0xQSL3y ago

Nice improvements! I'd be interested to see how much overhead tailscales magicsock adds and what a flamegraph after the change looks like. Mostly crypto or still a lot of networking syscall time?

raggi3y ago

magicsock definitely does a bunch more work, and we do look at both profiles. The magicsock profile is harder to read as a consequence of being a more complex path, adding packet filters, the indirection for DERP and other NAT busting details, etc. Jordan did do some optimizations in the magicsock path alongside this wireguard-go work to get us over the 10gbps line.

Overall the summary of time spent is still a similar story at the coarse scale - our recent optimizations mean that we're getting ever closer to the point where we need to start working on the next layer, such as optimizing the queues (visible here in the chanrecv and scheduler times - Go runtime stuff), and once we get that out of the way things like crypto and copying will become targets. The work goes on, we have lots of plans and ideas!

ignoramous3y ago

Super neat.

Have these optimizations (TCP GRO/GSO) been applied to non-root tailscale? I imagine, the changes needed are wildly different as the TUN device itself is gvisor/netstack. I believe, the UDP GRO/GSO part (discussed in today's blog post) may work as-is.

1 more reply

timando3y ago

I have no idea what a gigabit per siemens is supposed to mean.

bradfitz3y ago

The blog post's actual title doesn't use that case.

jeffbee3y ago

Half-way through the article it just says UDP receive coalescing, once, and never mentions it again. Do they mean interrupt mitigation? If so, using what parameters?

ignoramous3y ago

I guess UDP receive coalescing is UDP GRO (generic recv offload) + recvmmsg(2)

raggi3y ago

Yup, this is referring to GRO.

IIRC we use a contiguous 64kb buffer in the first scatter-gather slot and 128 messages per syscall in the current tuning.

jordanwhited3y ago

Author here. There was no interrupt tuning performed on the devices under test. UDP receive coalescing was enabled via the UDP_GRO sockopt.

kierank3y ago

Surely the problem with GSO is that you're now bursting UDP over the wire as fast as possible and that will be problematic for downstream switches?

infogulch3y ago

Love to hear optimization stories, great work!

1vuio0pswjnm73y ago

https://man.netbsd.org/wg.4

perryh23y ago

I see a new blog post from Tailscale, I upvote.

1vuio0pswjnm73y ago

https://github.com/juanfont/headscale/releases/expanded_asse...

j / k navigate · click thread line to collapse

80 comments

ThePhysicist3y ago

Pretty amazing that you can achieve such a throughput in a Golang userspace program. I wonder if other UDP based protocols like QUIC can attain those numbers as well.

ffk3y ago

Interestingly, the fastest CPU based network switches tend to do full kernel bypass. The kernel is generally slow compared to OVS and VPP, especially when they traverse over something like DPDK.

arghwhat3y ago

There is no kernel bypass in wireguard-go, just a user-space implementation fast implementation with smart use of syscalls to minimize the overhead of being split between user-space and kernel-space.

With io_uring, DPDK-style kernel bypass might stop making sense altogether.

1 more reply

ilyt3y ago

1 more reply

renewiltord3y ago

Are there consumer (<$2k) network switches that can do Wireguard in a very fast path?

1 more reply

preseinger3y ago

go is pretty fast

of course lots of caveats there, what does normal-ish mean, well probably most of the work is gonna be i/o bound, it should run on normal server-class hardware, et cetera et cetera

but nothing yet

api3y ago

For most compiled languages or languages with very good VMs like Java benchmarks are really testing the quality of the implementation and the depth of the implementor's understanding.

I'd bet that very good Go and Rust programmers could probably converge to almost identical performance.

What I wouldn't be on is that Go could equal Rust in the area of small memory footprint or on small devices.

3 more replies

ThePhysicist3y ago

_0w8t3y ago

Caddy (a web server written in Go) is like two times slower than Nginx on many benchmarks.

3 more replies

adam_arthur3y ago

Seems Rust places well in some composite benchmarks. Go is further down the list. Of course this depends on the quality of the implementation and doesn't account for UX/usability

https://www.techempower.com/benchmarks/#section=data-r21&tes...

1 more reply

kodah3y ago

Slack has a system called Nebula that's pretty adjacent to userspace WireGuard.

unixhero3y ago

Nebula is a Tailscale clone

1 more reply

VWWHFSfQ3y ago

is the go userspace program actually shoveling this data or are they in-kernel buffer copies a la sendfile and the like

wmf3y ago

Yes, userspace has to touch the data to encrypt/decrypt it.

red0point3y ago

What‘s missing from all these figure is the resulting latency. It‘s often the case that vendors show impressive throughput numbers, but then the latency is terrible at that throughput.

Do you have those numbers as well?

raggi3y ago

  + tc qdisc add dev internet root handle 1: cake docsis ack-filter-aggressive nat bandwidth 40mbit lan
  + ip link add name ifbinternet type ifb
  + tc qdisc add dev internet handle ffff: ingress
  + tc qdisc add dev ifbinternet root cake bandwidth 500mbit lan
  + ip link set ifbinternet up
  + tc filter add dev internet parent ffff: matchall action mirred egress redirect dev ifbinternet

On the LAN side, between the same machines (fq_codel only, default settings), running iperf3 alongside ping:

Under max load ([ 5] 0.00-57.73 sec 3.72 GBytes 554 Mbits/sec receiver):

  10 packets transmitted, 10 received, 0% packet loss, time 9013ms
  rtt min/avg/max/mdev = 2.625/3.620/4.536/0.646 ms

Zero load:

  10 packets transmitted, 10 received, 0% packet loss, time 9014ms
  rtt min/avg/max/mdev = 0.648/0.954/1.713/0.306 ms

  10 packets transmitted, 10 packets received, 0.0% packet loss
  round-trip min/avg/max/stddev = 3.845/11.363/34.152/8.940 ms

This is already long for an HN response, and so much more to say, but I hope it helps!

nikisweeting3y ago

Very curious to learn more about CAKE tuning with tailscale, would love to see a post someday about how the two interact and when/why it might be needed?

dtaht3y ago

I look forward to more, at a longer RTT.

majke3y ago

Okay, as far as I understand this writeup.

There are two sides, userspace UDP socket to receive wg packets on. Then the tap file descriptor to receive unencrytped packets from the host OS.

However, buffering with UDP_GRO is interesting, since you need to pass potentially large 64KiB buffer to kernel since you don't know how large the next GRO-packet is. (this is a digression)

The same discussion is about UDP. With UDP you can use TUN_F_USO, however, this is only available in kernel 6.2. This might be why there arent' too many UDP numbers in the article.

2bluesc3y ago

The missing feature from Tailscale for me is the ability to host a Tailscale only DNS zone.

Currently need to do this manually, but seems overly redundant since Tailscale already does 90% of this with MagicDNS and is fast because it's in their client vs a remote server.

zrail3y ago

Step 1: install Tailscale and Docker on a VM or whatever

Step 2: set up a Technitium container in host networking mode

Step 3: configure Technitium with a stub zone pointing your ts.net name at 100.100.100.100

Step 4: set up a zone for whatever.tld

Step 5: set up a DNAME record for ts.whatever.tld pointing at your ts.net domain

Result: querying this new DNS server with machine.ts.whatever.tld resolves to machine.blah-foo.ts.net resolves to that machine's 100.64.0.0 address.

https://technitium.com/dns/

2bluesc3y ago

I know this can be done manually (and I do), but the issue with that is that: 1. It's manual 2. Single point of failure of this server that was needed

Running `drill @100.100.100.100 <node_name>.<magic_dns_domain>.ts.net` is 0ms because it's local, and doesn't depend on a single DNS server running somewhere on my Tailscale network.

1 more reply

madjam0023y ago

Unfortunately there aren’t any options for it on the Tailscale control panel, but if you use Headscale you can configure it and take advantage of it now.

2bluesc3y ago

I searched and couldn't find anything in the tailscale client repo. Link to the issue?

Did find headscale docs about "Setting custom DNS records"[0]. It seems only `A` and `AAAA` records are supported. This might be the start of setting up headscale this weekend.

[0] https://github.com/juanfont/headscale/blob/main/docs/dns-rec...

1 more reply

kristofferR3y ago

Tailscale is awesome, so damn recommended. Taildrop (AirDrop for everything, included in Tailscale) is especially recommended, it makes it so damn easy to send files between all your devices.

raybb3y ago

https://tailscale.com/kb/1106/taildrop/ seems to be the docs.

klabb33y ago

Sync, continuous backup and transfer are all quite different use-cases.

Most backup/sync products are designed to work in the background and often require upload before download. I don’t know if syncthing does streaming syncs though.

Taildrop is somewhere in between – I think you have to be on same tail net, but no need for awareness of the other device’s file system.

kristofferR3y ago

Syncthing is slower, you need to act on both devices.

With Taildrop you just need to share something with a couple of clicks, and it'll appear on the device(s) you share it to.

withinboredom3y ago

I’ve really wanted to try tailscale. I fear I’ll like it, and I don’t want another company to have a monopoly on simple things so everyone forgets how to do them.

2bluesc3y ago

Had similar feelings and did like it more then I thought I could.

My escape hatch from the monopoly is headscale[0] which I can self host.

[0] https://github.com/juanfont/headscale

xena3y ago

You can even host Headscale over Tailscale, amusingly: https://tailscale.dev/blog/headscale-funnel

beambot3y ago

I don't think I'd classify their zero-config p2p-style VPN as "simple" -- or at least, certainly not simple to replicate...

More to the point, I hope their technology becomes commonplace & gratis a la LetsEncrypt for SSL Certificates.

withinboredom3y ago

I mean, setting up a WireGuard vpn is pretty darn simple, even into a k8s cluster. It’s not rocket science or anything; which is kinda my point. They make it too easy, and that worries me.

2 more replies

necubi3y ago

Tailscale is amazing. I was able to set up our AWS VPN with it in <30 minutes, and it's just worked ever since. Getting new users set up is similarly seamless.

If this means I continue to forget how to run OpenVPN I consider that well worth it.

thenipper3y ago

It's made putting internal apps in a private subnet on a VPC a very trivial process. Like took me an afternoon and works well for my small 40 person company.

rektide3y ago

database641283y ago

Why does the in-kernel WireGuard perform so much worse on the AWS instances?

0xQSL3y ago

Nice improvements! I'd be interested to see how much overhead tailscales magicsock adds and what a flamegraph after the change looks like. Mostly crypto or still a lot of networking syscall time?

raggi3y ago

ignoramous3y ago

Super neat.

1 more reply

timando3y ago

I have no idea what a gigabit per siemens is supposed to mean.

bradfitz3y ago

The blog post's actual title doesn't use that case.

jeffbee3y ago

Half-way through the article it just says UDP receive coalescing, once, and never mentions it again. Do they mean interrupt mitigation? If so, using what parameters?

ignoramous3y ago

I guess UDP receive coalescing is UDP GRO (generic recv offload) + recvmmsg(2)

raggi3y ago

Yup, this is referring to GRO.

IIRC we use a contiguous 64kb buffer in the first scatter-gather slot and 128 messages per syscall in the current tuning.

jordanwhited3y ago

Author here. There was no interrupt tuning performed on the devices under test. UDP receive coalescing was enabled via the UDP_GRO sockopt.

kierank3y ago

Surely the problem with GSO is that you're now bursting UDP over the wire as fast as possible and that will be problematic for downstream switches?

infogulch3y ago

Love to hear optimization stories, great work!

1vuio0pswjnm73y ago

https://man.netbsd.org/wg.4

perryh23y ago

I see a new blog post from Tailscale, I upvote.

1vuio0pswjnm73y ago

https://github.com/juanfont/headscale/releases/expanded_asse...

j / k navigate · click thread line to collapse