Unit testing a TCP stack (2015) (opens in new tab)

(snellman.net)

135 pointsPuercoPop4y ago35 comments

35 comments

There's a lot of push back from engineers - especially people at lower levels of the stack - against testing infrastructure. One particularly famous example is Linux. Rather than testing before merging in code, they merge in code and then test the release candidate as a whole. It also seems game developers are extremely against automated testing frameworks as a whole. I've heard many times that it would be impossible to develop an enemy AI in a test-driven way (I did this for a senior project in college - finished the AI before the game was able to even start testing it [0]).

I wonder what would need to happen to convince people that:

1. Even if you do something extremely low level, you can draw a distinction between your hardware and the interface that 99% of your software runs at.

2. You can develop complex behaviors iteratively with automated testing just like you can develop complex programs iteratively (tests are just programs).

[0] - https://github.com/gravypod/it491-disabler-ai

PezzaDev4y ago

Automated testing is hard to justify for games because you are not simply finishing features and moving on. You are constantly experimenting. Throwing out features and ideas and reworking them is part of the creative process. Automated testing adds overhead to iteration and isn't free, so you have to be very selective. At the end of the day it is more important for the game to be fun than it is stable.

jsnell4y ago

Though there are games where the fun has been proven, and it's important to be able to iterate fast without breaking what's already there. There's a great 2016 blog post from Riot on how they test LoL:

https://technology.riotgames.com/news/automated-testing-leag...

foxfluff4y ago

> I wonder what would need to happen to convince people that ..

3. It's worth it.

I work at (relatively) low levels, and I would absolutely love to have extensive tests (plus more, e.g. TLA+ models to prove critical properties of the systems I work on).

The pushback comes from stakeholders. They don't want to invest time and money into automated testing.

And when no automated testing has been done yet, you can guess that the system hasn't been architected to be easily testable. Figuring out how to add useful tests without massive (time-consuming and expensive, potentially error-prone) re-architecting is also something that requires quite a bit of investment.

Of course a part of is just lack of experience. If someone who knows how it's done could lead by example and show the ropes, that'd probably help. Getting the framework off the ground could be the key to sneaking in some tests in the future, even when nobody asks for them.

rad_gruchalski4y ago

I did tests for something like this once. Not low level, more of a set of 30+ microservices but the concept is there. The black box testing. This was for a smart home solution based on RabbitMQ. The client wanted to replace RabbitMQ with Kafka but they were anxious because there was no way to verify that the replacement would behave the same way.

So we have spent 2 months writing black box tests against the RabbitMQ version, swapped it out with Kafka and fixed all issues within a couple of weeks.

Since then, I believe that the integration tests are so much more valuable than unit tests.

axguscbklp4y ago

Or maybe those people know what they are talking about and the truth is that many people who are big fans of automated testing tend to overrate its value when it comes to many areas of software development. Testing takes effort and makes it harder to change things. In an ideal world where testing came for free then sure, more testing would be better. In the real world there are tradeoffs. If I am writing code that controls a spaceship then it makes sense to spend a huge amount of effort on testing. On the other hand, if I am adding a feature to a web application then in my personal experience, most of the time adding automated testing is a waste of effort.

bsder4y ago

> Or maybe those people know what they are talking about and the truth is that many people who are big fans of automated testing tend to overrate its value when it comes to many areas of software development.

I would normally agree with you, but a TCP Stack is one of those things where I vehemently disagree.

Communication stacks, in general, are giant piles of implicit state unless you go out of your way to manage the state explicitly. As such, they have obscure bugs which are difficult to find when some of the cases are hit rarely.

A communication stack really needs to be written such that the inputs (including TIME), the outputs, the current state, and the next state are all quite explicit. This enables you to test the stack because it is now deterministic with respect to inputs and outputs.

Yes, it's not easy. And it requires that you really mean it and architect it that way. You may not be able to evolve your current stack and may have to throw it away--that's never going to be popular.

However, every single time I have done this for a communication stack (USB, CANopen, BLE, etc.), the result was that the new stack quickly overtook the old stack on basically every metric worth monitoring (throughput, latency, reliability, bug rate, etc.).

Now, to be fair, I was obviously replacing a communication stack that was some level of "pile of crap" or I wouldn't have done it. However, I'm just one person, and those stacks generally came from a company who had a vested interest in it not sucking. I'm not some amazing programmer, and I certainly didn't spend more time on it than the original stack, so it really comes down to the fact that the "underlying architecture" was simply a better idea.

1 more reply

gravypod4y ago

There's been a lot of research, and internal studies, done at many companies that show pretty impressive benefits.

When really questioned most engineers just say "I know my code works" or "I test my code, I don't need automated tests". That's the mentality I just don't understand.

> Testing takes effort and makes it harder to change things.

If it "makes things hard to change" just delete the test? You'll still get the benefit of knowing XYZ are broken/altered. You can also automate end-to-end and black box tests which should absolutely not require any modification if you're just refactoring.

> If I am writing code that controls a spaceship then it makes sense to spend a huge amount of effort on testing. On the other hand, if I am adding a feature to a web application then in my personal experience, most of the time adding automated testing is a waste of effort.

If you are working something that is allowed to fail, then sure, you don't really need to care about what practices you do. It's a very end-all-be-all argument to say "it's ok for my things to break". That argument goes just the same for all of these things:

"Why do I need a version control system? It's fine if I manually merge my code incorrectly"

"Why do I need a build system? It's fine if I forget to recompile one of my files"

etc.

In addition: the "argument" for automated testing isn't that it will just prevent you from breaking something. It's that it lets you know when things change and makes it easy to update your code without manually checking if things are broken. Recently, when adding features to our frontend, I just run our tests and update a png file in our repo. I then play around until my styling is how I like it. It's completely automated and saves me a lot of time. It also lets others know immediately when their CSS change will effect, or will not effect, my components.

1 more reply

lloydatkinson4y ago

I agree - it seems it's only excuses that keep certain devs away from testing. Even Factorio manages testing, though its more of an integration test. I'm sure it could be done with unit tests too.

gravypod4y ago

I think even integration tests would be a huge improvement for most teams and software. As long as some automated checks are run before code is merged you're going to save yourself a lot of heart ache.

eatonphil4y ago

I couldn't make it past serialization/deserialization logic in my own hobbiest TCP/IP stack. Even that part was super buggy. Next time around I'm definitely going to be unit testing more parts otherwise it's too hard for a beginner to get the easy parts right let alone the harder parts.

Also, take a look at gvisor's network stack. It's definitely unit tested.

https://github.com/google/gvisor/tree/master/pkg/tcpip/link/... (an example)

amscanne4y ago

This is perhaps a better example: https://github.com/google/gvisor/blob/master/pkg/tcpip/trans...

Also, some networking tests use separate frameworks (which look more like the setup the original post is describing, since those are needed also), e.g.: https://github.com/google/gvisor/tree/master/test/packetimpa...

cjfd4y ago

Yes, a TCP stack certainly is complex enough to warrant serious automated testing and/or TDD.

The idea of putting the TCP stack in user space is interesting. If one actually could map the memory of the whole device into user space one could maybe have fewer system calls and therefore have better performance.

Also, what I find somewhat irritating about using a linux system is how often one needs to run commands as root (sudo) for common administrative tasks like mounting a disk or stuff like that. Having a user space TCP stack could also decrease the need for that as far as setting up the network is concerned. If the linux machine is single user, as most of them are nowadays, it makes more sense that way, I think.

rmetzler4y ago

> one needs to run commands as root (sudo) for common administrative tasks like mounting a disk

I would think if you don't do this, an attacker who is able to execute code but is non-root yet could easily elevate permissions by shadowing legitimate pathes and trick root into executing untrusted code.

I'm not a security engineer and just find it interesting, so if my thinking is off, please correct me.

stefan_4y ago

The whole "map the PCIE device into userspace process memory" thing is called DPDK (https://www.dpdk.org/)

ay4y ago

And you can combine the two:

https://fd.io/docs/vpp/master/whatisvpp/hoststack.html

And there is a sister project using this tech to get noticeable speed-ups:

https://wiki.fd.io/view/VSAP

Disclaimer: I am involved with the VPP project.

stingraycharles4y ago

> Having a user space TCP stack could also decrease the need for [root privileges] as far as setting up the network is concerned.

I think it’s important to distinguish between the protocol (TCP) and the hardware device. You would still absolutely need to talk to the device, it’s just that moving a lot of the logic to user space means much less context switching for system calls for the application.

I can imagine on Linux you can talk directly to /dev/eth0 if you would want to (in the same way that you can talk to /dev/sda), and then you would be back at square one regarding root privileges.

monocasa4y ago

> I can imagine on Linux you can talk directly to /dev/eth0 if you would want to (in the same way that you can talk to /dev/sda), and then you would be back at square one regarding root privileges.

It's a AF_PACKET, SOCK_RAW socket rather than a device file, but yes.

avinassh4y ago

> The idea of putting the TCP stack in user space is interesting.

Indeed! Julia Evans wrote a really nice post explaining the usecases and benefits - https://jvns.ca/blog/2016/06/30/why-do-we-use-the-linux-kern...

yakubin4y ago

You don't need to be root to mount disks, when you have udisks installed (which would be almost all distros by default). See udisksctl(1): <https://manpages.debian.org/buster/udisks2/udisksctl.1.en.ht...>

boomlinde4y ago

There's nothing inherent about Linux which prevents you from running everything as uid 0. If you're fine with every process you run having the same full privileges and shared ownership of everything, you should.

Most machines, at least outside embedded devices, are not like this. They are multi-user systems even when there's only ever one breathing thing at the desk because it offers a degree of separation between the privileges of your daemons, your pid 1, your web browser etc.

heurisko4y ago

I think the point was that root shouldn’t be required for “common administrative tasks”. The nuclear option of running everything as root doesn’t address this.

2 more replies

jsnell4y ago

One benefit we discovered with this test framework after the blog post was written was that it made it much more convenient to do fuzzing and differential testing of the TCP stack. The core problem with fuzzing TCP is that there's a lot of incrementally built up state, and everything is extremely timing-dependent.

You basically need the fuzzer to have a model of TCP state so that it can effectively explore the state space, which is quite complicated and not something you can do with off-the shelf tools.

But once you have a bunch of unit tests designed to put the TCP stack into a specific state + a way of saving and restoring that state, it's really easy to just have snapshot of interesting situations where you can run a fuzzer on the next packet to be transmitted and see what happens.

10000truths4y ago

It would be nice to have a bring-your-own-I/O TCP stack library that *doesn’t* rely on custom callbacks - something like BearSSL but for TCP, where the stack is just a pure state machine object and the user is responsible for explicitly shunting packets to and from the state machine, retaining control over when and how the I/O is done. Instead of having to define callbacks for retrieving time and consuming packets, why not explicitly pass the timestamp and packet data to a state machine object via a direct function call?

Slix4y ago

Is Cloudflare's Quiche QUIC library https://github.com/cloudflare/quiche similar to what you're looking for? All I/O must be done by the caller.

10000truths4y ago

Yeah, that’s the general idea. Essentially a state machine with a send queue and receive queue, and four operations:

- Input received raw data

- Output received application data

- Input application data to send

- Output raw data to send

Obviously, since TCP connection state is time sensitive, the “raw data” wouldn’t just be the IP packet and headers, but also a time stamp telling the state machine when that packet was received/sent. If you want the state machine to keep track of time even when no packets are being received or sent, there could be an additional operation just to input a timestamp without additional packets. In effect, time is just another input that the user is responsible for feeding to the state machine at sufficiently fine intervals.

In practice, you could emulate this pattern with a callback-oriented protocol stack by populating an in-memory send/receive queue in your callback function, but that design can be somewhat inflexible because it forces potentially undesirable constraints, e.g. an extra memory copy that could otherwise be elided.

j / k navigate · click thread line to collapse

35 comments

gravypod4y ago

I wonder what would need to happen to convince people that:

1. Even if you do something extremely low level, you can draw a distinction between your hardware and the interface that 99% of your software runs at.

2. You can develop complex behaviors iteratively with automated testing just like you can develop complex programs iteratively (tests are just programs).

[0] - https://github.com/gravypod/it491-disabler-ai

PezzaDev4y ago

jsnell4y ago

https://technology.riotgames.com/news/automated-testing-leag...

foxfluff4y ago

> I wonder what would need to happen to convince people that ..

3. It's worth it.

I work at (relatively) low levels, and I would absolutely love to have extensive tests (plus more, e.g. TLA+ models to prove critical properties of the systems I work on).

The pushback comes from stakeholders. They don't want to invest time and money into automated testing.

rad_gruchalski4y ago

So we have spent 2 months writing black box tests against the RabbitMQ version, swapped it out with Kafka and fixed all issues within a couple of weeks.

Since then, I believe that the integration tests are so much more valuable than unit tests.

axguscbklp4y ago

bsder4y ago

I would normally agree with you, but a TCP Stack is one of those things where I vehemently disagree.

1 more reply

gravypod4y ago

There's been a lot of research, and internal studies, done at many companies that show pretty impressive benefits.

When really questioned most engineers just say "I know my code works" or "I test my code, I don't need automated tests". That's the mentality I just don't understand.

> Testing takes effort and makes it harder to change things.

"Why do I need a version control system? It's fine if I manually merge my code incorrectly"

"Why do I need a build system? It's fine if I forget to recompile one of my files"

etc.

1 more reply

lloydatkinson4y ago

I agree - it seems it's only excuses that keep certain devs away from testing. Even Factorio manages testing, though its more of an integration test. I'm sure it could be done with unit tests too.

gravypod4y ago

eatonphil4y ago

Also, take a look at gvisor's network stack. It's definitely unit tested.

https://github.com/google/gvisor/tree/master/pkg/tcpip/link/... (an example)

amscanne4y ago

This is perhaps a better example: https://github.com/google/gvisor/blob/master/pkg/tcpip/trans...

cjfd4y ago

Yes, a TCP stack certainly is complex enough to warrant serious automated testing and/or TDD.

rmetzler4y ago

> one needs to run commands as root (sudo) for common administrative tasks like mounting a disk

I'm not a security engineer and just find it interesting, so if my thinking is off, please correct me.

stefan_4y ago

The whole "map the PCIE device into userspace process memory" thing is called DPDK (https://www.dpdk.org/)

ay4y ago

And you can combine the two:

https://fd.io/docs/vpp/master/whatisvpp/hoststack.html

And there is a sister project using this tech to get noticeable speed-ups:

https://wiki.fd.io/view/VSAP

Disclaimer: I am involved with the VPP project.

stingraycharles4y ago

> Having a user space TCP stack could also decrease the need for [root privileges] as far as setting up the network is concerned.

I can imagine on Linux you can talk directly to /dev/eth0 if you would want to (in the same way that you can talk to /dev/sda), and then you would be back at square one regarding root privileges.

monocasa4y ago

> I can imagine on Linux you can talk directly to /dev/eth0 if you would want to (in the same way that you can talk to /dev/sda), and then you would be back at square one regarding root privileges.

It's a AF_PACKET, SOCK_RAW socket rather than a device file, but yes.

avinassh4y ago

> The idea of putting the TCP stack in user space is interesting.

Indeed! Julia Evans wrote a really nice post explaining the usecases and benefits - https://jvns.ca/blog/2016/06/30/why-do-we-use-the-linux-kern...

yakubin4y ago

boomlinde4y ago

heurisko4y ago

I think the point was that root shouldn’t be required for “common administrative tasks”. The nuclear option of running everything as root doesn’t address this.

2 more replies

jsnell4y ago

You basically need the fuzzer to have a model of TCP state so that it can effectively explore the state space, which is quite complicated and not something you can do with off-the shelf tools.

10000truths4y ago

Slix4y ago

Is Cloudflare's Quiche QUIC library https://github.com/cloudflare/quiche similar to what you're looking for? All I/O must be done by the caller.

10000truths4y ago

Yeah, that’s the general idea. Essentially a state machine with a send queue and receive queue, and four operations:

- Input received raw data

- Output received application data

- Input application data to send

- Output raw data to send

j / k navigate · click thread line to collapse