There is no kernel bypass in wireguard-go, just a user-space implementation fast implementation with smart use of syscalls to minimize the overhead of being split between user-space and kernel-space.
With io_uring, DPDK-style kernel bypass might stop making sense altogether.
It's similar for applications; if you can, say, decode whole DNS packet in one go, you don't really want kernel to spend time decoding UDP packet, then you decoding the rest of the packet; doing it in one step is much faster.
in fact, i have a standing bet with some of my rustacean friends that they can't show me a typical HTTP service in rust, which has performance numbers (rps, latency, throughput) that i can't meet or beat in go
of course lots of caveats there, what does normal-ish mean, well probably most of the work is gonna be i/o bound, it should run on normal server-class hardware, et cetera et cetera
but nothing yet
I'd bet that very good Go and Rust programmers could probably converge to almost identical performance.
What I wouldn't be on is that Go could equal Rust in the area of small memory footprint or on small devices.
One area where Rust is still better are memory-constrained environments e.g. on mobile and on microcontroller, though there's tinygo and the Go runtime gets slimmer as well, so now you can have binaries and memory footprints smaller than 5 MB on most mobile platforms, which is absolutely acceptable even for budget phones. I think Tailscale e.g. runs their modified version of wireguard-go on all mobile clients without issues.
https://www.techempower.com/benchmarks/#section=data-r21&tes...
Do you have those numbers as well?
We increased the sizes of the UDP buffers in the prior round of optimizations. The kernel defaults for UDP buffers are too small to approach the throughput discussed here - and the default sizings were the primary source of lots of dropped packets. I raised those to 7mb, which seems like an odd number, but it's the largest you can set on macOS before the kernel rejects it - likely we'll eventually head for a per-platform split. At these speeds a 7mb buffer represents up to 5ms of flow data, though this does not imply that it creates 5ms of bufferbloat - it just means that this increased buffer could itself account for 5ms in the worst non-lossy case. On the userspace side Tailscale also has some more buffer space now (we're reading and writing lists of packets at a time, not single packets), but the sizing there is more complex.
This topic in general is much more complex - in the first throughput post I originally started to dig into it, and we cut that in editing because it was making the post too dense and there wasn't space to give the topic the attention it deserves. One day we'll talk about this too. Typically right now we add very little latency, low millis or lower - we actually add more jitter than latency, as any userspace program would. It's still orders of magnitude lower than the levels which even concern a typical realtime application such as gaming or communications - for example someone was recently talking about using Tailscale on their Steamdeck while on vacation to play Hogwarts streaming from their PC.
In the meantime, a real world example for you. I have a border router that I built using a relatively cheap piece of hardware (Intel(R) Celeron(R) J4105 CPU @ 1.50GHz). It has NICs that support GRO/GSO, but the CPU is the bottleneck for throughput. The box does 563MBits/sec inbound to the LAN over Tailscale (949 Mbits/sec raw). I run this as an exit-node for my workstation all the time, even though that's in the same building - and do so for the sake of diagnosing bugs and experiencing the product full time. In my initial test today, under peak load the exit node adds 35ms of latency each way. I was surprised by this, so I checked when going direct rather than via the exit node, I see 15ms down and 30ms up of latency increase under peak load. It seems Comcast dropped some capacity since I last tuned my uplink!
I then re-tuned CAKE on the router uplink to be more aggressive resulting in a raw bloat of 0ms/0ms, and then retested with the Tailscale exit node. With these more aggressive CAKE tunings, Tailscale also stayed at 0ms/0ms. This CAKE tuning ate a chunk of throughput capacity, as expected. The specific tuning here being for a Comcast 1000/40 link, and the system CPU bound at 500mbps for forwarding:
+ tc qdisc add dev internet root handle 1: cake docsis ack-filter-aggressive nat bandwidth 40mbit lan
+ ip link add name ifbinternet type ifb
+ tc qdisc add dev internet handle ffff: ingress
+ tc qdisc add dev ifbinternet root cake bandwidth 500mbit lan
+ ip link set ifbinternet up
+ tc filter add dev internet parent ffff: matchall action mirred egress redirect dev ifbinternet
On the LAN side, between the same machines (fq_codel only, default settings), running iperf3 alongside ping:Under max load ([ 5] 0.00-57.73 sec 3.72 GBytes 554 Mbits/sec receiver):
10 packets transmitted, 10 received, 0% packet loss, time 9013ms
rtt min/avg/max/mdev = 2.625/3.620/4.536/0.646 ms
Zero load: 10 packets transmitted, 10 received, 0% packet loss, time 9014ms
rtt min/avg/max/mdev = 0.648/0.954/1.713/0.306 ms
What do these numbers mean? In practice they mean you'll notice WiFi more than you'll notice Tailscale, but we can and will still do better over time. Here's WiFi from a MacBook to the border router on the same LAN segment (no WireGuard/Tailscale): 10 packets transmitted, 10 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 3.845/11.363/34.152/8.940 ms
This is already long for an HN response, and so much more to say, but I hope it helps!There are two sides, userspace UDP socket to receive wg packets on. Then the tap file descriptor to receive unencrytped packets from the host OS.
To speed up the userspace UDP socket it's desirable to use UDP_GRO flag on RX, and UDP_SEGMENT flag on TX. `tx-udp-segmentation` is a HW help for the latter. No need for any checksums and stuff. This is just speedup for userspace "classic" UDP socket.
However, buffering with UDP_GRO is interesting, since you need to pass potentially large 64KiB buffer to kernel since you don't know how large the next GRO-packet is. (this is a digression)
On the tap side, the article implies they enabled TUN_F_TSO4, which is a magical offload flag on tun interface. With it it is possible to get large pakets form the host OS. This is where it gets interesting. If you get a very large block from the host, like say 14KiB or larger.... how do you push it to the wireguard socket? I guess it's nececesary to packetize it back to small-MSS packets before encrypting. That means recreating TCP headers (with seq numers) and filling the checksum. This sounds like "fun".
The same on TX side towards the host... if you get a number of TCP segments from the wg tunnel, decrypt them.... do you push them as one large TUN_F_TSO segment to tun? or do you push one-by-one and rely on the kernel to GRO them? I didn't quite get it from the article. Or maybe it's possible to send large packets over wg without segmentation?
The same discussion is about UDP. With UDP you can use TUN_F_USO, however, this is only available in kernel 6.2. This might be why there arent' too many UDP numbers in the article.
They have Magic DNS, but that only works for individual Tailscale nodes. I want multiple DNA records pointing to a single Tailscale node. Would be even better if I could use my own domain (subdomain even better) instead of their long `foo-bar.ts.net` domain.
Currently need to do this manually, but seems overly redundant since Tailscale already does 90% of this with MagicDNS and is fast because it's in their client vs a remote server.
Step 2: set up a Technitium container in host networking mode
Step 3: configure Technitium with a stub zone pointing your ts.net name at 100.100.100.100
Step 4: set up a zone for whatever.tld
Step 5: set up a DNAME record for ts.whatever.tld pointing at your ts.net domain
Result: querying this new DNS server with machine.ts.whatever.tld resolves to machine.blah-foo.ts.net resolves to that machine's 100.64.0.0 address.
My point was that MagicDNS is implemented in the Tailscale client on each machine (fault tolerant, 0ms latency) and has almost all the things necessary (DNS resolver, push mechanism for record updates) except for a custom defined zone.
Running `drill @100.100.100.100 <node_name>.<magic_dns_domain>.ts.net` is 0ms because it's local, and doesn't depend on a single DNS server running somewhere on my Tailscale network.
Unfortunately there aren’t any options for it on the Tailscale control panel, but if you use Headscale you can configure it and take advantage of it now.
Did find headscale docs about "Setting custom DNS records"[0]. It seems only `A` and `AAAA` records are supported. This might be the start of setting up headscale this weekend.
[0] https://github.com/juanfont/headscale/blob/main/docs/dns-rec...
It's the first I hear of this. I wonder if there's any big advantage of this for someone who is already using syncthing for the same purpose? Biggest thing I could hope for is that it's faster. But I generally don't keep Tailscale running on mobile because I don't need it to and don't like the persistent notification.
Most backup/sync products are designed to work in the background and often require upload before download. I don’t know if syncthing does streaming syncs though.
Another difference is transfers can easily be untrusted, as in sender and receiver don’t need access to each others file systems. Take magic wormhole (or email attachments for that matter) as an example.
Taildrop is somewhere in between – I think you have to be on same tail net, but no need for awareness of the other device’s file system.
With Taildrop you just need to share something with a couple of clicks, and it'll appear on the device(s) you share it to.
My escape hatch from the monopoly is headscale[0] which I can self host.
More to the point, I hope their technology becomes commonplace & gratis a la LetsEncrypt for SSL Certificates.
If this means I continue to forget how to run OpenVPN I consider that well worth it.
Overall the summary of time spent is still a similar story at the coarse scale - our recent optimizations mean that we're getting ever closer to the point where we need to start working on the next layer, such as optimizing the queues (visible here in the chanrecv and scheduler times - Go runtime stuff), and once we get that out of the way things like crypto and copying will become targets. The work goes on, we have lots of plans and ideas!
Have these optimizations (TCP GRO/GSO) been applied to non-root tailscale? I imagine, the changes needed are wildly different as the TUN device itself is gvisor/netstack. I believe, the UDP GRO/GSO part (discussed in today's blog post) may work as-is.
IIRC we use a contiguous 64kb buffer in the first scatter-gather slot and 128 messages per syscall in the current tuning.