Tests that sometimes fail (opens in new tab)

(samsaffron.com)

217 pointssams996y ago127 comments

127 comments

We've had a couple of cases of flaky tests failing builds over the last two years at my company. Most often it's browser / end-to-end type tests (e.g. selenium-style tests) that are the most flaky. Many of them only fail in 1-3% of cases, but if you have enough of them the chances of a failing build is significant.

If you have entire builds that are flaky, you end up training developers to just click "rebuild" the first one or two times a build fails, which can drastically increase the time before realizing the build is actually broken.

An important realization is that unit testing is not a good tool for testing flakyness of your main code - it is simply not a reliable indicator of failing code. Most of the time it's the test itself that is flaky, and it's not worth your time making every single test 100% reliable.

Some things we've implemented that helps a lot:

1. Have a system to reproduce the random failures. It took about a day to build tooling that can run say 100 instances of any test suite in parallel in CircleCI, and record the failure rate of individual tests.

2. If a test has a failure rate of > 10%, it indicates an issue in that test that should be fixed. By fixing these tests, we've found a couple of techniques to increase overall robustness of our tests.

3. If a test has a failure rate of < 3%, it is likely not worth your time fixing it. For these, we retry each failing test up to three times. Not all test frameworks support retying out of the box, but you can usually find a workaround. The retries can be restricted to specific tests or classes of tests if needed (e.g. only retry browser-based tests).

justinpombrio6y ago

> If a test has a failure rate of < 3%, it is likely not worth your time fixing it.

How do you know? What you say is plausible, but it's also plausible that these rarely-failing tests also rarely-fail in production, and occasionally break things badly and cause outages or make customers think of your software as flaky.

Since you say this, I presume you've spent the time to actually track down the root causes of several tests that fail < 3% of the time? If so, what did you find? Some sort of issues with the test framework, or issues with your own code that you're confident would only ever be exposed by testing, or something else? I'm very curious.

paulddraper6y ago

It's possible, but after fixing lots of these, my experience says usually talking about stuff like clicking a button before a modal animates out of the way.

It's sort if a "bug" in that yes, clicking here and then here 1ms later doesn't do do the best thing, but it's basically irrelevant.

Testing is inherently a probabilistic endeavor.

"What can I do that is most likely to prevent the largest amount of bugginess?"

Fixing tests that rarely fail is -- in my experience -- a poor answer to such a question.

henrikschroder6y ago

> Testing is inherently a probabilistic endeavor.

That's a pretty powerful insight!

I think that a lot of developers who are firmly in the test-driven camp don't realize this, but instead think that if you have 100% test coverage, your code will work 100% of the time. Fixing bugs, to them, is "just" an inevitable result of increasing your test coverage, so that's what they focus on.

matharmin6y ago

My point here is that even if it may be because of flaky code, general unit and integration tests are the wrong tools to test for flaky code. The only exception I have encountered here is if you have code that is written to specifically handle concurrent situations, and your test is focussing specifically on testing the concurrency part.

The most common places these flaky tests occur are with integration/browser-based tests, where there are multiple layers of tools that each fail a small percentage of the time.

Unit tests also sometimes fail because of not cleaning up state properly, which only breaks things when tests run in a very specific order. Or sometimes subtle assumptions in the tests about database ordering that is only valid 99% of the time.

humanrebar6y ago

> Most of the time it's the test itself that is flaky

I have always understood that unit tests must inherently be deterministic for the reason you explain.

A small test that is not deterministic is testing something other than "the unit" since there is another independent variable unaccounted for, often the state of the database or the configuration of a test environment.

Not that unit tests are perfect. Unit testing a concurrent data structure without threads (which are inherently nondeterministic) is not especially useful.

mikepurvis6y ago

I see there as being a tension between determinism and mocking. Classic TDD dogma says to mock super close to the unit under test, so that the only logic in play is the logic within that unit. Which is all well and good, but there's lots of code out there where the stuff that breaks is the stuff on the interfaces; once you mock that out, you've removed a significant chunk of what might legitimately break, and the therefore diminished the value of the test.

So it's a balance. Sometimes it really is worth it to just attack that one function with its weird snarl of if statements and initial conditions— totally. But there are other cases where part of what you want is to inspect what happens in the adjacent object, on a different thread, as a result of stimulating something under test conditions. This isn't wrong, and these kinds of tests can be really hard to get completely deterministic, especially if the CI environment is some heavily-loaded VM host with totally different thread switching characteristics from your laptop.

bluGill6y ago

I have come to conclude that excessive mocks are a symptom of poor architecture.

Classic TDD as you describe (see the other reply, classic TDD is different) works great for algorithms: take some data, manipulate it, and get different data out. There is no need for mocks. This is where you business logic should be, and it is easy to test.

However this fails in the real world because algorithms are but a minority of code: most code in my experience is just moving data around from subsystem to subsystem, and external collaborators. Here you do have collaborators and the interactions are the point. Mocks now start to make sense because the point is my subsystem deliver data to that something else, and I shouldn't know or care what that something else is.

I've seen the above fail in several ways. I've seen people mock their algorithm from the communication, but in practice the communication and the algorithm are tightly coupled anyway so changes in once will change the other.

Worse, I see many people test not the subsystem boundaries, but boundaries within the subsystem. That is they start writing the subsystem, and then realize (correctly) that they need to break the subsystem up, then they test the subsystem as it is broken down. This seems good, but it leads to brittle systems that cannot be changed because the sub-subsystem is now not allowed to change because it would break tests..

To understand this, remember, a test is an assertion that something will not change. Thus if you mock a collaborator you are asserting that the collaborator is a different subsystem and you and not allowed to refactor across this boundary. If the boundary is not an architecture boundary you shouldn't mock it because you might want to change it.

mikekchar6y ago

> Classic TDD dogma says to mock super close to the unit under test, so that the only logic in play is the logic within that unit.

I suppose it depends on your definition of "classic TDD dogma". Mocking really wasn't a thing until TDD had been around for about 5-10 years, so super classic TDD dogma has always been "don't mock" ;-)

"London School", GOOSE, Outside-in approach has always been to mock heavily. I call it "wish based programming". You write a test, wishing that you had some facility and since you don't have it, you mock it. Then once the test is in place, you can write your code and eventually write production code that represents the mock (and personally, I remove the mock at that point).

It was really after that, as far as I can tell, that people started to get the idea that you should mock all your collaborators in order to isolate your units. This kind of isolation was never a thing originally (see Kent Beck's original book on the subject). Even if you watch DHH's conversations with Kent Beck (and I think Martin Fowler???) on the topic and they state pretty clearly that "Chicago School" is to avoid mocking except as a last resort (my own personal preference as well). Also take a look at Michael Feather's discussion in his Legacy Code book for a good description of what the original ideas what fakes, stubs and mocks were. These days those definitions are practically lost.

I'm not sure why there has been this idea that mocking was always a part of TDD, but it definitely is a popular notion.

2 more replies

humanrebar6y ago

I wasn't arguing against less deterministic tests. I was just saying "unit test" isn't the name for them. Call them "small tests" or "smoke tests" or make up a new term.

dnautics6y ago

Not all tests are unit tests. I had a property test I was running that I eventually just turned off because it was working just fine on everyone's machine but would fail 60% of the time on Travis due to time out issues. It got worse from 30% after Travis was sold, I suspect they are skimping on the aws. I probably should have written a more effect dependent timeout, but it was hard to justify recoding something when your test is long and your retrigger is via Travis.

munk-a6y ago

I find

> 3. If a test has a failure rate of < 3%, it is likely not worth your time fixing it. For these, we retry each failing test up to three times. Not all test frameworks support retying out of the box, but you can usually find a workaround. The retries can be restricted to specific tests or classes of tests if needed (e.g. only retry browser-based tests).

to be pretty terrifying. I know that folks are under different amounts of pressure but we'd reject that code from merging here (or revert it out when we discovered the flakiness) as it's basically just a half-finished test that requires constant baby sitting.

viraptor6y ago

I'm not sure how you'd reject that flakey test. We're talking <3% so first let's assume that you don't even see the failure until 10 other PRs get merged. Not only do you not know what caused the failure, it could be that the failure is in a test which was in the code for ages but the new code breaks its assumptions / initial environment.

Sometimes you can't just point at one thing and say reject this or revert that without a long investigation.

munk-a6y ago

If the test failure is detected (so, you get super lucky) you should immediately reject the code, including the tests failing before some other fixups that didn't effect that test... Oftentimes it will take a long time to surface these, but I'm of the opinion that a broken build is show stopping until it's resolved - that doesn't mean 5-alarm all devs rush to the scene, but it does mean a free person picking it up as their next task or, barring that, bumping someone off of feature or other work to address the issue.

It may take a while... but while that flakey test exists in your codebase it will leverage a constant cost on all of your developers.

1 more reply

corndoge6y ago

I'm jealous of your workplace's attitude and latitude towards testing

munk-a6y ago

We're the inheritors of a legacy code base, part of this involved taking a strong stance to go from zero to hero in terms of testing, no minor bugs are fixed in areas of code not covered with automated testing - this has made our feature work slow right down but we are lucky to have management's support in paying this cost now rather than paying interest on it as time progresses.

dahart6y ago

> Most of the time it's the test itself that is flaky

I recently went through a heavy de-flaking on a suite of Selenium tests. I found this comment to be true in my case; it was reasonable-seeming assumptions in the tests that caused flakiness more often than anything else. The second most common cause was timing or networking issues with the Selenium farm.

Spending the time actually de-flaking the tests was quite enlightening and lead to some new best practices for both writing tests, and for spinning up Selenium instances.

Because of that experience, I'm not sure I would agree with giving up on tests that fail less than 3% of the time, because fixing one of those cases can sometimes fix all of them. Learning the root causes of test failures lead me to implement some fixes that increased the stability of the entire test suite. Sometimes there's only one problem but it causes all the tests in a group to fail one at a time, infrequently and seemingly randomly.

scaryclam6y ago

I'm having similar issues with Selenium test at the moment. Can you give any insights into deflaking tests failing on timing or networking issues (I'm guessing it's a case by case basis, but generic tips might work for some problems)? This is by far the biggest pair of reasons for the suite failing for us, so any help would be really...er...helpful! :D

dahart6y ago

For me, the main class of flaky Selenium tests were expecting CSS related changed to appear on-screen in the same frame or one frame later, when sometimes they take more than one frame. The main result was to rarely/never execute multiple actions in a row without waiting; the tests changed considerably into pairs of [ action -> wait for specific CSS change on screen ] throughout all the tests. Also I did have some sleep calls for a few hard to test things, even though I knew it was a bad practice, and I spent time eliminating all of them and figuring out the fundamental reason it was hard to measure and/or adding something to the front end code to signal via CSS that a long running operation was complete.

On the setup side, the main networking issue for me was ssh. This might be different for you if you use either a Selenium cloud service, or a proper VPN setup. I was spinning up a virtual network on AWS for a Selenium farm in a fairly ad-hoc and manual way using ssh. The tests were sometimes (only very occasionally) starting before the ssh connection was actually ready, so I had to put a delay in the setup script to send & wait until a packet had actually been received before launching the test. I used netcat for that.

1 more reply

jdlshore6y ago

I fix flaky tests, and a 3% failure rate would drive me crazy. But I don't automatically rerun tests. I was skeptical of the idea that rerunning tests would help, so I did a bit of math:

A test suite with 1 test that fails 3% of the time will succeed 97% of the time. (1-.03)

A test suite with 10 tests that fail 3% of the time will succeed 74% of the time. (.97^10)

100 flaky tests? Now half your test runs fail. (.97^100)

You're retrying three times? Now your test suite is slow, but you can have up to 2,000 flaky tests before it starts becoming a real problem. Or 60,000 if you retry four times. (1-.03^3)^2000 = 94.7%; (1-.03^4)^60000 = 95.3%.

My conclusion: rerunning flaky tests is a legit way of solving the problem, as long as your tests aren't too slow. Still makes my skin itch, though. Fixing flaky tests forces me to face design flaws in my code.

(The math, in case I did it wrong: .03^3 = f = chance of a 3% failure test failing three runs in a row. 1-f = s = chance of test succeeding. s^1000 = chance of test run with 1000 flaky tests succeeding.)

matharmin6y ago

Your math is mostly correct, except that for 100 flaky tests your test run will only pass in 4.8% of cases.

That's why it's very important to retry individual tests, and not the entire test run.

jdlshore6y ago

Oops, you're right. I was moving too quick and misread .04755 (4.8%) as .4755 (48%).

mceachen6y ago

Every company I've founded or worked for has struggled with flaky tests.

Twitter had a comprehensive browser and system test suite that took about an hour to run (and they had a large CI worker cluster). Flaky tests could and did scuttle deploys. It was a never-ending struggle to keep CI green, but most engineers saw de-flaking (not just deleting the test) as a critical task.

PhotoStructure has an 8-job GitLab CI pipeline that runs on macOS, Windows, and Linux. Keeping the ~3,000 (and growing) tests passing reliably has proven to be a non-trivial task, and researching why a given task is flaky on one OS versus another has almost invariably led to discovery and hardening of edge and corner conditions.

It seems that TFA only touched on set ordering, incomplete db resets and time issues. There are many other spectres to fight as soon as you deal with multi-process systems on multiple OSes, including file system case sensitivity, incomplete file system resets, fork behavior and child process management, and network and stream management.

There are several aspects I added to stabilize CI, including robust shutdown and child process management systems. I can't say I would have prioritized those things if I didn't have tests, but now that I have it, I'm glad they're there.

ajeet_dhaliwal6y ago

In my experience complex end-to-end tests cast a wide net that often results in finding a lot of issues and they provide enormous value. Their main negative is maintenance around robustness as the article discusses and hardening tests can take a lot of investment. That said, the alternative is worse (not having them) so I find your approach, and the author’s is what I’ve often done. I think there needs to be understanding (across the team and management) that automated tests are software and it will require a similar dev effort to maintaining any other software, especially so because there usually aren’t tests to test the tests!

I’m founder at Tesults (https://www.tesults.com) where we have a flaky test indicator that makes identifying these tests easier. It’s free to try and if you can’t get budget for a proper plan send me an email and I’ll do what I can.

In general the only way to never have flaky tests is to have simpler tests but I find those often don’t provide as much value - that’s just my personal belief after having spent years focused on automated tests, e2e tests do have robustness issues but the bugs they find make them totally worth it. Out of the issues mentioned in the article that affected my tests the most, it’s timing. They can be overcome though, I’ve run test suites with a couple of thousand e2e tests (browser) that have been highly robust and reliable after time was devoted to hardening them. You do have to focus on that and refuse to add new test cases until the existing ones are sorted out in some cases.

hexfran6y ago

Sorry for the OT, what is "TFA"?

mceachen6y ago

Sorry. The Fine Article. I didn't mean it in the disparaging connotation.

It's a reference to RTFM, Read The Fine Manual.

TIL: RTFM was a phrase from the 40s : "Read the field manual."

panopticon6y ago

I never seen the F in RTFM mean “fine” before. I’ve always seen it used as the more vulgar “read the f*ing manual”.

1 more reply

OJFord6y ago

It's the same as OP, except it only means the Post, not the Poster. (The F* Article.)

Usually it's a kind of negative retort - 'well if you'd actually bothered to read TFA then ...' - but increasingly it seems to be used without such emotion (particularly, to me anyway, on HN) to mean simply 'the submission'.

ludwigvan6y ago

https://www.urbandictionary.com/define.php?term=TFA

wtetzner6y ago

I always read it as "the featured article".

joosters6y ago

In an old job, we had a frustrating test that passed well over 99 times in 100. It was shrugged off for a very long time until a developer eventually tracked it down to code that was generating a random SSL key pair. If the first byte of the key was 0, faulty code elsewhere would mishandle the key and the test failed.

Keeping the randomness in the test was the key factor in tracking down this obscure bug. If the test had been made completely deterministic, the test harness would never have discovered the problem. So although repeatable tests are in most cases a good thing, non-determinism can unearth problems. The trick is how to do this without sucking up huge amounts of bug-tracking time...

(Much effort was spent in making the test repeatable during debugging, but of course the crypto code elsewhere was deliberately trying to get as much randomness as it could source...)

kenha6y ago

It doesn't seem to be a strong argument to have non-deterministic tests.

There was the logic that generates the SSL key pair, and there is the faulty logic that consumes it. Based on the description, it seems it's an indication of missing test coverage around the faulty code. If, when the faulty code was written, more time were spent on understanding the assumptions the code has made, then maybe the test wouldn't appear in the first place.

This anecdote, however, does bring up a good point: Don't shrug off intermittently failed tests - Dig in and understand the root cause of it.

mcv6y ago

I'm not at all surprised that nobody considered the possibility that code might fail if a key starts with zero. It's often hard to identify all the edge cases.

Now that this edge case has been found, of course it should be replaced by deterministic tests that tests the consuming code with different kinds of keys, including one with a leading zero.

rakoo6y ago

So, in short, the test was failing because the code was incorrect, aka the test was working as designed. We've all had flaky tests but the attitude that says "run it again" has always been the wrong one, because there is a problem, yet we're all ok with deferring the issue.

AstralStorm6y ago

The only other solution is exhaustive property testing. And even that is not workable when concurrency is in play.

dllthomas6y ago

Good luck exhaustively testing something with a cryptographic key as input. Non-exhaustive property testing is also pretty cool, though.

grogers6y ago

There do exist frameworks that allow exhaustive testing of concurrent code. They never really became mainstream though.

https://www.microsoft.com/en-us/research/publication/chess-a...

http://www.1024cores.net/home/relacy-race-detector

toast06y ago

> There was the logic that generates the SSL key pair, and there is the faulty logic that consumes it. Based on the description, it seems it's an indication of missing test coverage around the faulty code. If, when the faulty code was written, more time were spent on understanding the assumptions the code has made, then maybe the test wouldn't appear in the first place.

Based on the description of the bug, the bug was in Microsoft's SChannel TLS stack (I know this, because I found it too, and got a workaround into OpenSSL). I don't know about you, but I haven't written a whole lot of comprehensive tests for any TLS stack that I've used, unless I wrote the stack. I'm assuming jooster didn't work for Microsoft, he just worked for a company that released software for some flavor of Windows and used their TLS stack, because it was there.

This definitely fits into the category of nobody is going to test this, because it's not going to occur to anyone to test how the third party TLS library they're using handle public keys encoded without leading zeros, until they've ran into it before. Having a random key generated in a test suite means you've got a chance to see it; if you're lucky (and if you don't just retry everything without knowing why).

joosters6y ago

Nope, this was on a variety of unix systems, but no Windows code at that point. We used some RSA libraries for crypto primitives and hardware acceleration, with a lot of custom in-house code (BigInts and other math routines). There was no OpenSSL usage. I think the bug turned out to be in our own code, because it was a case of lots of frustration tracking it down, an ‘ah-ha!’ moment, and a quick code check-in fixing the problem there and then. (Yes, we also wrote a new test case!)

But it’s interesting to hear about an identical sounding bug in similar code/routines. I’d say ‘what are the chances?’ but crypto code is always painfully hard.

skybrian6y ago

Non-deterministic testing is related to fuzzing, which is a well-known way to find security bugs. The problem is that it's often too expensive to do all the time.

aidenn06y ago

Have a single source of randomness and record the seed as part of the test run. Then you have repeatable failures combined with finding obscure bugs.

joosters6y ago

Well, yes. It's much easier to diagnose after the problem has been found :)

The 'gotcha' part in this case was that the SSL keygen code did not just use a random number seed but was grasping for entropy from other sources too (PIDs, perhaps? I can't recall the detail). That unknown made initial attempts to recreate the problem difficult.

Plus the failing test in question was not directly testing the SSL, merely making use of it. There were other SSL-specific tests run separately but they missed this strange corner case (in 'normal' use, it wasn't like 1 in 256 transactions would fail, otherwise that would have been much more obvious and we'd have had spurious-seeming failures all over the place).

That brings up another test pain: You can write lots of specific unit tests for every individual feature your code has, and they can all pass just fine. But when feature A, D and H happen to all be in use at once, you hit a separate problem. Onwards to 100% coverage...

bmm6o6y ago

> Well, yes. It's much easier to diagnose after the problem has been found :)

That's kind of unfair, it's pretty much the logical deduction if you think about randomness in your unit tests. I.e., suppose the test fails, then what? You want to reproduce, but really all you know is that the test indicates the possibility of a bug. In order to investigate, you really need to follow the same code path that the test did. How do you do that? Capture the seed.

1 more reply

jon8896y ago

I'm a bit confused, that seems like the test did it's job, it found the faulty code elsewhere. (unless by elsewhere you mean in the test code) It just appeared to be flaky and was treated as such until someone looked into it.

munk-a6y ago

I would say that your non-deterministic components here weren't a good thing - instead the test was poorly written and didn't cover the assumptions of the code under test well. The fact that this bug was revealed by a test is useful, but I should hope that now the tests have cemented that case in a regression suite in a deterministic manner.

guelo6y ago

> the test was poorly written and didn't cover the assumptions of the code under test well

That's a cop out, you rarely know all the assumptions of all the code. The point of tests is to hopefully suss out those unknowns. Just saying "you should think harder and write better tests" doesn't result in better code.

munk-a6y ago

I don't disagree that it's hard to know all these shortfalls ahead of time, that's sort of why software has bugs and why all us devs stare in amazement anytime it's suggested "Could you folks just... write less bugs" but when the builds start to fail it should be acknowledged that your tests are incomplete, and it should (ideally) be a high priority task to either revert out the feature or fix the feature to resolve the issue. (There are all sorts of reasons this could be difficult, but I'd urge you to advocate strongly for fixing bugs as a higher priority task than feature work - it ends up saving businesses money)

tinus_hn6y ago

Sometimes you can store the random seed so you get the best of both worlds. Not with crypto though.

pytester6y ago

What I found to be the major reasons for flaky tests:

* Non-determinism in the code - e.g. select without an order by, random number generators, hashmaps turned into lists, etc. - Fixed by turning non-deterministic code into deterministic code, testing for properties rather than outcomes or isolating and mocking the non-deterministic code.

* Lack of control over the environment - e.g. calling a third party service that goes down occasionally, use of a locally run database that gets periodically upgraded by the package manager - fixed by gradually bringing everything required to run your software under control (e.g. installing specific versions without package manager, mocking 3rd party services, intercepting syscalls that get time and replacing them with consistent values).

* Race conditions - in this case the test should really repeat the same actions so that it consistently catches the flakiness.

roland356y ago

Some other funny causes I've seen:

- The temperature is much hotter/colder than normal

- Someone is inadvertently holding down a button or key on the machine under test

- The wrong version of software is loaded onto the machine

cpeterso6y ago

> - The temperature is much hotter/colder than normal

I like this 1999 story about a flaky test at Be:

> Two test engineers were in a crunch. The floppy drive they were currently testing would work all day while they ran a variety of stress tests, but the exact same tests would run for only eight hours at night. After a few days of double-checking the hardware, the testing procedure, and the recording devices, they decided to stay the night and watch what happened. For eight hours they stared at the floppy drive and drank espresso. The long dark night slowly turned into day and the sun shone in the window. The angled sunlight triggered the write-protection mechanism, which caused a write failure. A new casing was designed and the problem was solved. Who knew?

https://www.haiku-os.org/legacy-docs/benewsletter/Issue4-22....

skohan6y ago

Those first two seem like inadequate insulation of the test suite (no pun intended).

taneq6y ago

> e.g. calling a third party service that goes down occasionally

I thought tests weren't meant to have external dependencies (or at least, ones outside the control of the test harness)?

pytester6y ago

In the past I've had the external dependencies included until it started to cause issues. Some dependencies in some projects (e.g. hard coded CDN links, time) haven't actually caused any problems.

For very complex dependencies I would build a mock that could run in a passthrough / mock mode where I could test realistically (in passthrough mode) and test deterministically (in mock mode, using a recording of the passthrough mode).

This would be helpful in getting rid of flaky tests (mock mode), ensure 3rd party services don't get hammered (mock mode) and being able to isolate and detect breakages caused by external service changes (passthrough mode).

DougBTX6y ago

In this context, yes, tests shouldn't require external dependencies. By "tests" we're really talking about tests like, "is this particular build consistent with its spec?"

There could be other types of test where a remote call would make sense, for example, "was the deployment successful?" tests might try to verify that the deployed version of the software can communicate with external dependencies correctly.

yebyen6y ago

There are also cases that are less justified that you might have, especially once you start going down the road of "my dev environment should be a clone of production"

If you have an Employee model and it returns certain attributes of an employee like Salary, you might have tests that depend on the structure of an employee. You might have, say, Job and Position models which define an employee-job and the base definition of the particular job. Say Position has a salary range associated, and Job has validation rules which check that the salary is in range.

You could define factories for all those things, or you could use real examples that are served by a live Employee API.

The canonical way to address this is with factories and mocks, if you have time do that! (It will probably save you in the long-run, when that complexity has grown a bit.)

If you just grab the example person whose salary is out of the range for their position and quickly test that the behavior in nearby modules matches your expectations, well, those are still tests, and you could be forgiven for writing them this way.

I think they call these the "London" and "Detroit" styles of mocking, but the short version IMHO is that a mistake was making dev as a clone of production, and any errors in judgement that came after that were merely coping mechanisms.

If you want your tests to tell you when something has changed that requires your attention, you need a test that hits this Employee API and will fail if the structure of the employees returned is no longer conforming to your expectations, even though it's external. The design of such a thing is something I won't profess to know how to do well.

(It's better to version your API and write a changelog that tells what you need to know if the old version has been replaced by a new version, but if you're writing these microservices all for yourself it can seem pedantic to explicitly version your API, too. There are also coping mechanisms you'll need to embrace once you get to "we're not incrementing the API version" and surprise, many of them are the same ones...)

marcosdumay6y ago

Each thing you remove from your tests reduces the results value by some amount.

For some programs, testing without external dependencies is basically useless. Other times, you can remove them without much loss. But it's always better if you can keep them.

dmitriid6y ago

In theory, yes. In practice it's sometimes inconvenient, or hard, or impossible to setup all the mocks and proxies. Especially in integration tests.

roland356y ago

There was one weird bug reported to me in an microcontroller based project I was recently working on which shut off half the LCD screen. I wrote a test which blasted the LCD screen with random characters and commands and did not see the same error for awhile... but it finally happened during a test! I was able to then see that when I was checking the LCD state between commands I only would toggle the chip select for the first half of the LCD (there were 2 driver chips built into the screen and you had to read each chip individually). There would be no way I could have recreated the bug without automated tests.

I have had to deal with non-deterministic tests with my embedded systems and robotic test suites and have found a few solutions to deal with them:

- Do a full power reset between tests if possible, or do it between test suites when you can combine tests together in suites that don't require a complete clean slate

- Reset all settings and parameters between tests. A lot of embedded systems have settings saved in Flash or EEPROM which can affect all sorts of behaviors, so make sure it always starts at the default setting.

- Have test commands for all system inputs and initialize all inputs to known values.

- Have test modes for all system outputs such as motors. If there is a motor which has a speed encoder you can make the test mode for the speed encoder input to match the commanded motor value, or also be able to trigger error inputs such as a stalled motor.

- Use a user input/dialog option to have user feedback as part of the test (for things like the LCD bug).

Robot Framework is a great tool which can do all these things with a custom Python library! I think testing embedded systems is generally much harder so people rarely do it, but I think it is a great tool which can oftentimes uncover these flaky errors.

darekkay6y ago

Related stories: "unit tests fail when run in Australia" [1] and "the case of the 500-mile email" [2]. There is a whole GitHub repository dedicated to some very interesting debugging stories [3].

[1] https://github.com/angular/angular.js/issues/5017

[2] http://www.ibiblio.org/harris/500milemail.html

[3] https://github.com/danluu/debugging-stories

zubspace6y ago

We call them Flip Floppers.

We do a lot of integration testing, more so than unit testing, and those tests, which randomly fail, are a real headache.

One thing I learned is that setting up tests correctly, independent of each other, is hard. It is even harder if databases, local and remote services are involved or if your software communicates with other software. You need to start those dependencies and take care of resetting their state, but there's always something: Services sometimes take longer to start, file handles not closing on time, code or applications which keeps running when another test fails... etc, etc...

There are obvious solutions: Mocking everything, removing global state, writing more robust test setup code... But who has time for this? Fixing things correctly can even take more time and usually does not guarantee that some new change in the future disregards your correct code...

pytester6y ago

>There are obvious solutions: Mocking everything, removing global state, writing more robust test setup code... But who has time for this?

I find that doing all of this tends to actually save time overall it's just that the up front investment is high and the payoff is realized over a long time.

Most software teams seem to prefer higher ongoing costs if it comes with quick wins to up front investment.

c0vfefe6y ago

Those are the age-old arguments against TDD. Every team will have to analyze the value proposition in their context to see if the return is worth the investment.

lm284696y ago

>There are obvious solutions: Mocking everything, removing global state, writing more robust test setup code... But who has time for this?

If you do it from the beginning and structure your code in a testable way it doesn't take much time. It saved me a few time in my current company; make a small change -> turns out it breaks a feature from 3-4 years ago that no one even remember -> look at the tests -> understand the feature as well as why what you did broke it.

If you try to do it after X years of coding without thinking about tests you're doomed though.

lukego6y ago

I have learned to love non-deterministic tests.

The world is non-deterministic. A test suite that can represent non-determinism is much more powerful than one that cannot. To paraphrase Dijkstra, "Determinism is just a special case of non-determinism, and not a very interesting one at that."

If a test is non-deterministic then a test framework needs to characterize the distribution of results for that test. For example "Branch A fails 11% (+/- 2%) of the time and Branch B fails 64% (+/- 2%) of the time." Once you are able to measure non-determinism then you can also effectively optimize it away, and you start looking for ways to introduce more of it into your test suites e.g. to run each test on a random CPU/distro/kernel.

muro6y ago

But you pay the cost of retrying the failing tests and lack of clear signal. And if the application code is flaky, users get to experience the breakage too.

lukego6y ago

If an application is flaky then I want to know: How frequently does it fail? How does this depend on combinations of configuration parameters? How does this compare between the stable, master, and next branches? etc.

The best way that I know for doing this is to write tests that are flaky because they expose the underlying flakiness in the application.

If an application is flaky and its test suite always runs 100% then I'd be pretty suspicious about that test suite being adequate.

mrkeen6y ago

> And if the application code is flaky

This is the only relevant factor. Forget the rest. Users don't experience your flaky tests just like they don't experience your messy Jira boards or your bad office coffee.

AstralStorm6y ago

How do you know which is failing without exhaustive analysis?

See, once you know why the test fails and it's not the tested application, which is exceedingly rare in practice, you can just disable it or fix it. But only if you're actually sure, not before.

1 more reply

mrkeen6y ago

Yes! To paraphrase John Hughes, "Every time you run your test suite, you should become more confident in your software."

throwaway57526y ago

Call it a pet peeve, but if we call it "chaos engineering" it costs a ton and gets people conference talks when a sporadic system integration issue is found. But if you have the same thing happen in a plain old CI half the time it will be ignored or flagged flaky.

dllthomas6y ago

IIUC, Chaos Engineering is about moving things out of "eh, it won't happen in production, I'll ignore it" into "it will happen in production, I'd better handle it", and making sure mitigation and recovery code is actually exercised in a realistic setting. "Periodic errors in CI that go unmitigated and produce test failures" seems very meaningfully distinct from chaos engineering.

throwaway57526y ago

If I use spinnaker and chaos monkey in a prod scale (or even in a prod experiment) to create a circumstance where I can't perform a write of a resource because I couldn't achieve quorum in a replica set and that let to an inconsistency between two data stores... is that meaningfully different than observing the same issue but caused by incorrect vpc routing or insufficienty resourced test instances leading/slow node startup times/race conditions in a CI test environment?

I think there is overlap and that it does not have to be a choice between either approach.

dllthomas6y ago

The meaningful question isn't in observing the issue, but in what happens after. Chaos engineering is about making sure you still have enough of a chance of success in the face of failures. For CI, success means an error report that correctly captures whether the PR in question is breaking anything (... at least, anything we're testing). If your process means you can be sloppy about isolation and still get that, then I'd be okay with calling that an example of "chaos engineering". If being sloppy about isolation means you have failing tests in many CI runs that have nothing to do with the changes under consideration, that's not "chaos engineering" - it's just bad CI.

mekane86y ago

As soon as I saw that whole section on database-related flakiness my mind went from "flaky unit tests" to "tests called unit tests that are actually integration tests". I worked on a team where we labored under that misconception for a long, long time. By the time we finally realized that many of the tests in our suite were integration tests and not unit tests it was too late to change (due to budget and timeline pressure).

I really like the different approaches to dealing with these flaky tests, that is a good list.

mceachen6y ago

I think it's important that engineers can distinguish between testing code in isolation versus "integration" or "system" testing, but I've seen a sophomoric stigma around integration tests that lead to mocking hell, and a hatred towards testing in general.

Unit tests are great. You want them. Craft your interfaces to enable them.

Integration and system tests are important too. Again, crafting higher level interfaces that allow for testing will, in general, lead to a more ergonomic API.

Analogously: unit tests ensure each of your LEGO blocks are individually well-formed. Integration tests ensure that the build instructions actually result in something reasonable.

why-el6y ago

I think the definition has evolved. Since usually the DB is reset between tests (and such reset is snappy), it's transparent enough to appear as unit testing. That we do this in a buggy manner or do not understand how to properly reset the DB does not negate that fact in my opinion. You could mock it or inject it, therefore doing unit testing in the traditional sense, but again you could introduce even nastier bugs and a whole lot of indirection overhead.

jonthepirate6y ago

Hi - I'm Jon, creator of "Flaptastic" (https://www.flaptastic.com/) and passionate advocate for unit test health.

Having coded at both Lyft and at DoorDash, I noticed both companies had the exact same unit test health problems and I was forced to manually come up with ways to make the CI/CD reliable in both settings.

In my experience, most people want a turnkey solution to get them to a healthier place with their unit testing. "Flaptastic" is a flaky unit tests recognition engine written in a way that anybody can use it to clean up their flaky unit tests no matter what CI/CD or test suite you're already using.

Flaptastic is a test suite plugin that works with a SAAS backend that is able to differentiate between a unit test that failed due to broken application code versus tests that are failing with no merit and only because the tests are not written well. Our killer feature is that you get a "kill switch" to instantly disable any unit test that you know is unhealthy with an option to unkill it later when you've fixed the problem. The reason is this is so powerful is that when you kill an unhealthy test, you are able to immediately unblock the whole team.

We're now working on a way to accept the junit.xml file from your test suite. We can run it through the flap recognition engine allowing you to make decisions on what you will do next if you know all of the tests that failed did fail due to known flaky test patterns.

If Flaptastic seems interesting, contact us on our chat widget we'll let you use it for free indefinitely (for trial purposes) to decide if this makes your life easier.

1 more reply

andrey_utkin6y ago

At Undo we develop a "software flight recorder technology" - basically think of `rr` reversible debugger, it is our open source competitor.

One particular usecase for Undo (besides obviously recording software bugs per se) is recording execution of tests. Huge time saver. We do this ourselves - when a test fails in CI, engineers can download a recording file of a failing test and investigate it with our reversible debugger.

roca6y ago

Yeah, this is huge. rr also has "chaos mode" to randomize things to make test failures easier to reproduce. (I understand Undo has something similar.)

I think that's one message that is completely lost in the article and in the rest of the comments here: it is possible to improve technology so that flaky tests are more debuggable.

With enough investment (hardware and OS support for low-impact always-on recording) we could make every flaky test debuggable.

bhaak6y ago

At our place, we call them "peuteterli" (losely translated: "could-be-ish" constructed from the French "peut être" and slapped on the local German diminutive -li.

For the ID issue I have a monkey patch for Activerecord:

      if ["test", "cucumber"].include? Rails.env
        class ActiveRecord::Base
          before_create :set_id

          def set_id
            self.id ||= SecureRandom.random_number(999_999_999)
          end
        end
      end

Unique IDs are also helpful when scanning for specific objects during test development. When all objects of different classes start with 1, it is hard to following the connections.

notacoward6y ago

I deal with this issue a lot in my current job, and did in my last job too. IMX timing issues are by far the most common culprit. Usually it's because a test has to guess how long a background repair or garbage-collection activity will take, when in fact that duration can be highly variable. Shorter timeouts mean tests are unreliable. Longer timeouts mean greater reliability but tests that sometimes take forever. Speeding up the background processes can create CPU contention if tests are being run in parallel, making other tests seem flaky. Various kinds of race conditions in tests are also a problem, but not one I personally encounter that often. Probably has to do with the type of software I work on (storage) and the type of developers I consequently work with.

No matter what, developers complain and try to avoid running the tests at all. I'd love to force their hand by making a successful test run an absolute requirement for committing code, but the very fact that tests have been slow and flaky since long before I got here means that would bring development to a standstill for weeks and I lack the authority (real or moral) for something that drastic. Failing that, I lean toward re-running tests a few times for those that are merely flaky (especially because of timing issues), and quarantine for those that are fully broken. Then there's still a challenge getting people to fix their broken tests, but life is full of tradeoffs like that.

Slartie6y ago

We're usually calling them "blinker tests" in our integration test suite. Reasons for blinker tests vary, but most are in line with what others here have already stated: concurrency, especially correct synchronization of test execution with stuff happening in asynchronous parts of the (distributed) system under test, is by far the biggest cause for problematic tests. This one is often exagerrated by the difference in concurrent execution on developer machines with maybe 4-6 cores and the CI server with 50-80, which often leads to "blinking" behavior that never happens locally, but every few builds on the CI server.

Second biggest is database transaction management and incorrect assumptions over when database changes become visible to other processes (which are in some way also concurrency problems, so it basically comes down to that). Third biggest is unintentional nondeterminism in the software, like people assuming that a certain collection implementation has deterministic order, but actually it doesn't, someone was just lucky to get the same order all the time while testing on the dev machine.

jonatron6y ago

"Making bad assumptions about DB ordering" That's caught me out before. Postgres is just weird, I had to run the same test in a loop for an hour before it'd randomly change the order.

anarazel6y ago

There's several reasons for potential ordering changes:

- the order of items on the page is different, due to the way tuples have been inserted (different external scheduling, different postgres internal scheduling) - concurrent sequential scans can coordinate relation scans, which is quite helpful for relations that are larger than the cache - different query plans, e.g. sequential vs index scans

Unless you specify the ORDER BY, there really isn't any guarantee by postgres. We could make it consistent, but that'd add overhead for everyone.

adamb6y ago

If anyone is looking for ideas for how to build tooling that fights flaky tests, I consolidated a number of lessons into a tool I open sourced a while ago.

https://github.com/ajbouh/qa

It will do things like separate out different kinds of test failures (by error message and stacktrace) and then measure their individual rates of incidence.

You can also ask it to reproduce a specific failure in a tight loop and once it succeeds it will drop you into a debugger session so you can explore what's going on.

There are demo videos in the project highlighting these techniques. Here's one: https://asciinema.org/a/dhdetw07drgyz78yr66bm57va

pjc506y ago

The two big problems seem to be concurrency (always a problem) and state, which immediately suggest that making things as functional as possible would help a lot.

Ideally all state that's used in a test would be reset to a known value at or before the start of the test, but this is quite hard for external non-mocked databases, clocks and so on.

For integration tests, do you run in a controllable "safe" environment and risk false-passes, or an environment as close as possible to production and risk intermittent failure?

A variant I've seen is "compiled languages may re-order floating point calculations between builds resulting in different answers", which is extremely annoying to deal with especially when you can't just epsilon it away.

AstralStorm6y ago

Why not both? Test suite too slow? Live test too dangerous or inconsistent?

rrnewton6y ago

Both this article and this comment thread include a number of different ideas regarding controlling (or randomizing) environmental factors: test ordering, system time, etc.

But why do all of this piecemeal? Our philosophy is to create a controlled test sandbox environment that makes all these aspects (including concurrency) reproducible:

https://www.cloudseal.io/blog/2018-04-06-intro-to-fixing-fla...

The idea is to guarantee that any flake is easy to reproduce. If people have objections to that approach, we'd love to hear them. Conversely, if you would be willing to test out our early prototype, get in touch.

invertednz6y ago

I used to work at a company with over 10,000 tests where we weren't able to get more than an 80% pass rate due to flaky tests. This article is great and covers a lot of the options for handling flaky tests. I founded Appsurify to make it easy for companies to handle flaky tests, with minimal effort.

First, don't delete them, flaky tests are still valuable and can still find bugs. We also had the challenge where a lot of the 'flakiness' was not the test or the application's fault but was caused by 3rd party providers. Even at Google "Almost 16% of our tests have some level of flakiness associated with them!" - John Micco, so just writing tests that aren't flaky isn't always possible.

Appsurify automatically raises defects when tests fail, and if the failure reason looks to be 'flakiness' (based on failure type, when the failure occurred, the change being made, previous known flaky failures) then we raise the defect as a "flaky" defect. Teams can then have the build fail based only on new defects and prevent it from failing when there are flaky test results.

We also prioritize the tests, which causes fewer tests to be run which are more likely to fail due to a real defect, which also reduces the number of flaky test results.

pure-awesome6y ago

> A few months back we introduced a game.

> We created a topic on our development Discourse instance. Each time the test suite failed due to a flaky test we would assign the topic to the developer who originally wrote the test. Once fixed the developer who sorted it out would post a quick post morterm.

What's the game here? It just seems like a process. Useful, sure, but not particularly fun...

boothby6y ago

I'm the primary developer for a heuristic, nondeterministic algorithm. It's both production software, and also a neverending research project. Specifically, I can't guarantee that a particular random seed will always produce identical results because that hobbles my ability to make future improvements to the heuristic. I've got reasonable coverage of my base classes and subroutines, but minor changes to the heuristic can have significant impact on the "power" of the heuristic.

My solution was to add a calibrated set of benchmarks. For each problem in the test suite, I measure the probability of failure. From that probability, I can compute the probability of n repeated failures. Small regressions are ignored, but large regressions (p < .001) splat on CI. It's fast enough, accurate enough, and brings peace of mind.

I understand that, and why, engineers hate this. But it's greatly superior to nothing.

tom-jh6y ago

We run in-browser end to end tests for our browser extension. There were several reasons for flakiness:

* Puppeteer (browser automation) bugs or improper use. Certain sequence of events could deadlock it, causing timeouts relatively rarely. The fix was sometimes upgrading puppeteer, sometimes debugging and working around the issue.

* Vendor API, particularly their oauth screen. When they smell automation, they will want to block the requests on security grounds. We have routed all requests through one IP address and reuse browser cookies to minimize this.

* Vendor API again, this time hitting limits on rare situations. We could have less parallel tests, but then you waste more time waiting.

Eventually, we will have to mock up this (fairly complex) API to progress. It's got to a point where I don't feel like adding more tests because they may cause further flakiness - not good.

mariefred6y ago

Flaky tests are indeed a big issue, the main concern being loss of confidence in the results.

The otherwise good advice for randomization has its drawbacks-

- it complicates issue reproduction, especially if the test flow itself is randomized and not just the data

- the same way it catches more issues, it might as well skip some

Something else that was mentioned but not stressed enough is the importance of clean environment as the basis for the test infrastructure.

A cleanup function is nice but using a virtual environment, Docker or a clean VM will save you a lot of debugging time finding environmental issues. The same goes for mocked or simplified elements if they contribute to the reproducibility of the system- a simpler in-memory database can help re creating a clean database for each test instead of reverting for example

AstralStorm6y ago

Sometimes it's the code that is flaky and not the test.

In case of concurrent execution there are a only a few reasonably working tricks like Relacy and other exhaustive ordering checkers as well as formal proofs. Neither is cheap to use, so you will always get flaky tests there - or rather tests that do not always fall.

mariefred6y ago

if the code is flaky then I have earned my pay honestly, this is a problem that should be solved.

Subtle concurrency issues are indeed very difficult to be found debugged and reproduced and randomization could help with that simply by covering more space.

roland356y ago

I agree I think a large majority of flaky tests, for me at least, stems from some variability in the initial conditions of the test. It is good to uncover all the dependencies!

AstralStorm6y ago

If it's the test that is flaky. But it means that if production is not extremely consistent, you will see these effects live. Better handle them correctly.

notacoward6y ago

Here's a Google testing blog post about the same thing in 2016.

https://testing.googleblog.com/2016/05/flaky-tests-at-google...

pytester6y ago

I really don't like their series of blog posts of flaky tests. The first one literally used the phrase "fact of life" and implied that nothing could really be done about it (suggesting avoiding high level tests as a result, which was dangerously bad advice) while this one reports a rate that is staggeringly high (16%!) and assumes an intrinsically hard problem ("world class engineers did this!") rather than a fault in their approach.

They could do with being a little more humble and focusing on improving their engineering practices.

zellyn6y ago

If any Googlers are reading this and have the knowledge, I’m curious whether things have improved since that article. The numbers are sobering.

bhuga6y ago

Not a googler, but they posted an update in 2017 with some more information: https://testing.googleblog.com/2017/04/where-do-our-flaky-te...

rellui6y ago

Personally I've always called them flaky tests. I agree with the article that flaky tests shouldn't be ignored completely. But the issue is they take much more effort than usual test failures to debug. So it comes down to a balancing act of how much effort you're willing to spend debugging these vs the chance that it's an actual issue.

In my few years of automation experience, I've only seen 2 actual instances where the flaky tests were an actual issues and one of them should've been found by performance testing. Almost all of the rest were environment related issues. It's tough testing across all of the different platforms without running into some environment instability.

mannykannot6y ago

Tests are part of the system too, and if you accept lower standards for your test suite than you think you hold the product to, you have actually lowered your standards for the product to those you accept for the tests.

ArturT6y ago

For annoying flaky features tests, I use rspec-retry gem to repeat the test a few times before marking it as failed. It helped for integration tests with external sandbox API.

I noticed discourse had a lot of flaky tests while using their repo to test my knapsack_pro ruby gem to run test suite with CI parallelisation. A few articles with CI examples of parallelisation can be found here https://docs.knapsackpro.com

I need to try the latest version of discourse code, maybe now it will be more stable to run tests in parallel.

chippy6y ago

One recent test that was sometimes failing was ordering a list. It was due to how I made a sequence of my fixtures using numbers as a affix to a string so it was ordering correctly unless e.g. "string 8, string 9, string 10".

I fixed it for me by creating a random selection from /usr/share/dict/words to make a large array of sorted words to choose from. This made the fixtures have better and amusing names such as "string trapezoidal, string understudy"

boyter6y ago

These sort of tests are perfect examples for me to add to https://boyter.org/posts/expert-excuses-for-not-writing-unit... Tongue in cheek it is but I’m always on the lookout for additional examples to flesh it out.

pavel_lishin6y ago

Flaky tests are one of the factors that led me to leave a previous job. Test coverage was already so bad (and honestly, so was the code) that it was difficult to do anything with confidence - add to this that tests sometimes worked meant that writing code was basically a dice-roll. I got tired of the stress.

piokoch6y ago

"Non-deterministic tests have two problems, firstly they are useless, secondly they are a virulent infection that can completely ruin your entire test suite."

"To this I would like to add that flaky tests are an incredible cost to businesses."

I think that the misconception here is that "tests should not fail", because they are "cost", "has to be analyzed and fixed", etc.

An integration or functional test that is guaranteed to never fail is kind of useless for me. Good test with a lot of assertions will fail occasionally since things are happening - unexpected data are provided, someone manually played with the database, ntp service was accidentally stopped and date in not accurate and filtering by date might be failing, someone plugged in some additional system that alters/locks data.

In case of unit tests, well, if everything is mocked and isolated then yes, such test probably should never fail, but unit tests are mostly useful only if there is some complicated logic involved.

notacoward6y ago

> An integration or functional test that is guaranteed to never fail is kind of useless for me.

I think that's an important distinction between functional and integration tests. Generally, a functional test is supposed to exercise a particular set of APIs or code paths - across components in a semi-realistic arrangement, so unlike a unit test where all but one would be mocked, but still pretty focused. It's OK for such a test to ignore concerns outside of its own scope. Data validation/sanitization should have its own tests, for example, and not be a part of every other functional test. That's just duplication of effort for very little benefit.

By contrast, it's reasonable for an integration test to fail due to something external like NTP failure ... once. After that, there should be a separate functional/regression test to ensure that the dependency is properly isolated, and integration tests should be expected to pass consistently unless there's a new kind of fault. That allows integration tests to capture all of those dependencies over time, until the full set approximates the set that exists in production.

Don't worry too much about the precise dividing line between functional and integration tests, though. The important thing is that they're not synonyms. Whatever one calls them, there are different classes of tests with different purposes. Statements like "tests should never fail" or "tests that fail are better" are too general to be useful across all kinds of tests.

YjSe2GMQ6y ago

You clearly have not worked on a codebase with thousands of tests. At my previous job the build system had an option to run a test N times concurrently in the cloud. I used this whenever I wanted to commit to some other project but some of their tests were garbage (to prove that test is flaky, and therefore to be ignored). You could even binary search (running 1000 times on each pivot point) to see who introduced the flakiness. Expensive but gets the job done.

In my projects I either fix the nondeterminism or delete such tests.

AstralStorm6y ago

Pseudorandom deterministic tests have their value, presuming you store faulty input and/or seed.

These are not exactly nondeterministic but sometimes people end up with that instead of pseudorandom ones.

rgoulter6y ago

"You won't have code like this obviously contrived example, but you might have code which is equivalent."

Ha, yes! The problem sounds super dumb and obvious once you explain it, but can be a PITA to track down or recognise in the code.

revskill6y ago

To me, unit tests only make sense for pure code.

For impure code, it made no sense to make a unit test.

Ability to separate pure vs impure code determines your test suites, where should be put in unit test, where should be put in integration test.

stagas6y ago

It looks like you are coupling your unit tests with your integration tests. At integration level, we test if the integration paths work under various conditions, that is, only the code that deals whether our unit has been called correctly, with the right parameters, etc. At unit level, we mock all of the dependencies and test the branches of the effective code under various conditions. And at the acceptance level we should be testing our business logic requirements, to make sure all of our features are working the way they should, especially during refactoring (where integration and unit tests are subject to change).

AstralStorm6y ago

Not even close. Functionally pure code can be proven correct instead of tested. Or it can be tested exhaustively. It's the exact case where typical tests are worthless.

That is a small piece of actual software, everything everywhere works with IO or state like databases, each of which comes with ordering and concurrency assumptions. Every time you have a variable that is changed, you have more state to test.

Almost all code is impure.

jdlshore6y ago

This is a great article. Grounded in experience, detailed, actionable. Nicely done.

j / k navigate · click thread line to collapse

127 comments

matharmin6y ago

Some things we've implemented that helps a lot:

justinpombrio6y ago

> If a test has a failure rate of < 3%, it is likely not worth your time fixing it.

paulddraper6y ago

It's possible, but after fixing lots of these, my experience says usually talking about stuff like clicking a button before a modal animates out of the way.

It's sort if a "bug" in that yes, clicking here and then here 1ms later doesn't do do the best thing, but it's basically irrelevant.

Testing is inherently a probabilistic endeavor.

"What can I do that is most likely to prevent the largest amount of bugginess?"

Fixing tests that rarely fail is -- in my experience -- a poor answer to such a question.

henrikschroder6y ago

> Testing is inherently a probabilistic endeavor.

That's a pretty powerful insight!

matharmin6y ago

The most common places these flaky tests occur are with integration/browser-based tests, where there are multiple layers of tools that each fail a small percentage of the time.

humanrebar6y ago

> Most of the time it's the test itself that is flaky

I have always understood that unit tests must inherently be deterministic for the reason you explain.

Not that unit tests are perfect. Unit testing a concurrent data structure without threads (which are inherently nondeterministic) is not especially useful.

mikepurvis6y ago

bluGill6y ago

I have come to conclude that excessive mocks are a symptom of poor architecture.

mikekchar6y ago

> Classic TDD dogma says to mock super close to the unit under test, so that the only logic in play is the logic within that unit.

I'm not sure why there has been this idea that mocking was always a part of TDD, but it definitely is a popular notion.

2 more replies

humanrebar6y ago

I wasn't arguing against less deterministic tests. I was just saying "unit test" isn't the name for them. Call them "small tests" or "smoke tests" or make up a new term.

dnautics6y ago

munk-a6y ago

I find

viraptor6y ago

Sometimes you can't just point at one thing and say reject this or revert that without a long investigation.

munk-a6y ago

It may take a while... but while that flakey test exists in your codebase it will leverage a constant cost on all of your developers.

1 more reply

corndoge6y ago

I'm jealous of your workplace's attitude and latitude towards testing

munk-a6y ago

dahart6y ago

> Most of the time it's the test itself that is flaky

Spending the time actually de-flaking the tests was quite enlightening and lead to some new best practices for both writing tests, and for spinning up Selenium instances.

scaryclam6y ago

dahart6y ago

1 more reply

jdlshore6y ago

I fix flaky tests, and a 3% failure rate would drive me crazy. But I don't automatically rerun tests. I was skeptical of the idea that rerunning tests would help, so I did a bit of math:

A test suite with 1 test that fails 3% of the time will succeed 97% of the time. (1-.03)

A test suite with 10 tests that fail 3% of the time will succeed 74% of the time. (.97^10)

100 flaky tests? Now half your test runs fail. (.97^100)

matharmin6y ago

Your math is mostly correct, except that for 100 flaky tests your test run will only pass in 4.8% of cases.

That's why it's very important to retry individual tests, and not the entire test run.

jdlshore6y ago

Oops, you're right. I was moving too quick and misread .04755 (4.8%) as .4755 (48%).

mceachen6y ago

Every company I've founded or worked for has struggled with flaky tests.

ajeet_dhaliwal6y ago

hexfran6y ago

Sorry for the OT, what is "TFA"?

mceachen6y ago

Sorry. The Fine Article. I didn't mean it in the disparaging connotation.

It's a reference to RTFM, Read The Fine Manual.

TIL: RTFM was a phrase from the 40s : "Read the field manual."

panopticon6y ago

I never seen the F in RTFM mean “fine” before. I’ve always seen it used as the more vulgar “read the f*ing manual”.

1 more reply

OJFord6y ago

It's the same as OP, except it only means the Post, not the Poster. (The F* Article.)

ludwigvan6y ago

https://www.urbandictionary.com/define.php?term=TFA

wtetzner6y ago

I always read it as "the featured article".

joosters6y ago

(Much effort was spent in making the test repeatable during debugging, but of course the crypto code elsewhere was deliberately trying to get as much randomness as it could source...)

kenha6y ago

It doesn't seem to be a strong argument to have non-deterministic tests.

This anecdote, however, does bring up a good point: Don't shrug off intermittently failed tests - Dig in and understand the root cause of it.

mcv6y ago

I'm not at all surprised that nobody considered the possibility that code might fail if a key starts with zero. It's often hard to identify all the edge cases.

Now that this edge case has been found, of course it should be replaced by deterministic tests that tests the consuming code with different kinds of keys, including one with a leading zero.

rakoo6y ago

AstralStorm6y ago

The only other solution is exhaustive property testing. And even that is not workable when concurrency is in play.

dllthomas6y ago

Good luck exhaustively testing something with a cryptographic key as input. Non-exhaustive property testing is also pretty cool, though.

grogers6y ago

There do exist frameworks that allow exhaustive testing of concurrent code. They never really became mainstream though.

https://www.microsoft.com/en-us/research/publication/chess-a...

http://www.1024cores.net/home/relacy-race-detector

toast06y ago

joosters6y ago

But it’s interesting to hear about an identical sounding bug in similar code/routines. I’d say ‘what are the chances?’ but crypto code is always painfully hard.

skybrian6y ago

Non-deterministic testing is related to fuzzing, which is a well-known way to find security bugs. The problem is that it's often too expensive to do all the time.

aidenn06y ago

Have a single source of randomness and record the seed as part of the test run. Then you have repeatable failures combined with finding obscure bugs.

joosters6y ago

Well, yes. It's much easier to diagnose after the problem has been found :)

bmm6o6y ago

> Well, yes. It's much easier to diagnose after the problem has been found :)

1 more reply

jon8896y ago

munk-a6y ago

guelo6y ago

> the test was poorly written and didn't cover the assumptions of the code under test well

munk-a6y ago

tinus_hn6y ago

Sometimes you can store the random seed so you get the best of both worlds. Not with crypto though.

pytester6y ago

What I found to be the major reasons for flaky tests:

* Race conditions - in this case the test should really repeat the same actions so that it consistently catches the flakiness.

roland356y ago

Some other funny causes I've seen:

- The temperature is much hotter/colder than normal

- Someone is inadvertently holding down a button or key on the machine under test

- The wrong version of software is loaded onto the machine

cpeterso6y ago

> - The temperature is much hotter/colder than normal

I like this 1999 story about a flaky test at Be:

https://www.haiku-os.org/legacy-docs/benewsletter/Issue4-22....

skohan6y ago

Those first two seem like inadequate insulation of the test suite (no pun intended).

taneq6y ago

> e.g. calling a third party service that goes down occasionally

I thought tests weren't meant to have external dependencies (or at least, ones outside the control of the test harness)?

pytester6y ago

In the past I've had the external dependencies included until it started to cause issues. Some dependencies in some projects (e.g. hard coded CDN links, time) haven't actually caused any problems.

DougBTX6y ago

In this context, yes, tests shouldn't require external dependencies. By "tests" we're really talking about tests like, "is this particular build consistent with its spec?"

yebyen6y ago

There are also cases that are less justified that you might have, especially once you start going down the road of "my dev environment should be a clone of production"

You could define factories for all those things, or you could use real examples that are served by a live Employee API.

The canonical way to address this is with factories and mocks, if you have time do that! (It will probably save you in the long-run, when that complexity has grown a bit.)

marcosdumay6y ago

Each thing you remove from your tests reduces the results value by some amount.

For some programs, testing without external dependencies is basically useless. Other times, you can remove them without much loss. But it's always better if you can keep them.

dmitriid6y ago

In theory, yes. In practice it's sometimes inconvenient, or hard, or impossible to setup all the mocks and proxies. Especially in integration tests.

roland356y ago

I have had to deal with non-deterministic tests with my embedded systems and robotic test suites and have found a few solutions to deal with them:

- Do a full power reset between tests if possible, or do it between test suites when you can combine tests together in suites that don't require a complete clean slate

- Have test commands for all system inputs and initialize all inputs to known values.

- Use a user input/dialog option to have user feedback as part of the test (for things like the LCD bug).

darekkay6y ago

Related stories: "unit tests fail when run in Australia" [1] and "the case of the 500-mile email" [2]. There is a whole GitHub repository dedicated to some very interesting debugging stories [3].

[1] https://github.com/angular/angular.js/issues/5017

[2] http://www.ibiblio.org/harris/500milemail.html

[3] https://github.com/danluu/debugging-stories

zubspace6y ago

We call them Flip Floppers.

We do a lot of integration testing, more so than unit testing, and those tests, which randomly fail, are a real headache.

pytester6y ago

>There are obvious solutions: Mocking everything, removing global state, writing more robust test setup code... But who has time for this?

I find that doing all of this tends to actually save time overall it's just that the up front investment is high and the payoff is realized over a long time.

Most software teams seem to prefer higher ongoing costs if it comes with quick wins to up front investment.

c0vfefe6y ago

Those are the age-old arguments against TDD. Every team will have to analyze the value proposition in their context to see if the return is worth the investment.

lm284696y ago

>There are obvious solutions: Mocking everything, removing global state, writing more robust test setup code... But who has time for this?

If you try to do it after X years of coding without thinking about tests you're doomed though.

lukego6y ago

I have learned to love non-deterministic tests.

muro6y ago

But you pay the cost of retrying the failing tests and lack of clear signal. And if the application code is flaky, users get to experience the breakage too.

lukego6y ago

The best way that I know for doing this is to write tests that are flaky because they expose the underlying flakiness in the application.

If an application is flaky and its test suite always runs 100% then I'd be pretty suspicious about that test suite being adequate.

mrkeen6y ago

> And if the application code is flaky

This is the only relevant factor. Forget the rest. Users don't experience your flaky tests just like they don't experience your messy Jira boards or your bad office coffee.

AstralStorm6y ago

How do you know which is failing without exhaustive analysis?

See, once you know why the test fails and it's not the tested application, which is exceedingly rare in practice, you can just disable it or fix it. But only if you're actually sure, not before.

1 more reply

mrkeen6y ago

Yes! To paraphrase John Hughes, "Every time you run your test suite, you should become more confident in your software."

throwaway57526y ago

dllthomas6y ago

throwaway57526y ago

I think there is overlap and that it does not have to be a choice between either approach.

dllthomas6y ago

mekane86y ago

I really like the different approaches to dealing with these flaky tests, that is a good list.

mceachen6y ago

Unit tests are great. You want them. Craft your interfaces to enable them.

Integration and system tests are important too. Again, crafting higher level interfaces that allow for testing will, in general, lead to a more ergonomic API.

Analogously: unit tests ensure each of your LEGO blocks are individually well-formed. Integration tests ensure that the build instructions actually result in something reasonable.

why-el6y ago

jonthepirate6y ago

Hi - I'm Jon, creator of "Flaptastic" (https://www.flaptastic.com/) and passionate advocate for unit test health.

If Flaptastic seems interesting, contact us on our chat widget we'll let you use it for free indefinitely (for trial purposes) to decide if this makes your life easier.

1 more reply

andrey_utkin6y ago

At Undo we develop a "software flight recorder technology" - basically think of `rr` reversible debugger, it is our open source competitor.

roca6y ago

Yeah, this is huge. rr also has "chaos mode" to randomize things to make test failures easier to reproduce. (I understand Undo has something similar.)

I think that's one message that is completely lost in the article and in the rest of the comments here: it is possible to improve technology so that flaky tests are more debuggable.

With enough investment (hardware and OS support for low-impact always-on recording) we could make every flaky test debuggable.

bhaak6y ago

At our place, we call them "peuteterli" (losely translated: "could-be-ish" constructed from the French "peut être" and slapped on the local German diminutive -li.

For the ID issue I have a monkey patch for Activerecord:

      if ["test", "cucumber"].include? Rails.env
        class ActiveRecord::Base
          before_create :set_id

          def set_id
            self.id ||= SecureRandom.random_number(999_999_999)
          end
        end
      end

Unique IDs are also helpful when scanning for specific objects during test development. When all objects of different classes start with 1, it is hard to following the connections.

notacoward6y ago

Slartie6y ago

jonatron6y ago

"Making bad assumptions about DB ordering" That's caught me out before. Postgres is just weird, I had to run the same test in a loop for an hour before it'd randomly change the order.

anarazel6y ago

There's several reasons for potential ordering changes:

Unless you specify the ORDER BY, there really isn't any guarantee by postgres. We could make it consistent, but that'd add overhead for everyone.

adamb6y ago

If anyone is looking for ideas for how to build tooling that fights flaky tests, I consolidated a number of lessons into a tool I open sourced a while ago.

https://github.com/ajbouh/qa

It will do things like separate out different kinds of test failures (by error message and stacktrace) and then measure their individual rates of incidence.

You can also ask it to reproduce a specific failure in a tight loop and once it succeeds it will drop you into a debugger session so you can explore what's going on.

There are demo videos in the project highlighting these techniques. Here's one: https://asciinema.org/a/dhdetw07drgyz78yr66bm57va

pjc506y ago

The two big problems seem to be concurrency (always a problem) and state, which immediately suggest that making things as functional as possible would help a lot.

Ideally all state that's used in a test would be reset to a known value at or before the start of the test, but this is quite hard for external non-mocked databases, clocks and so on.

For integration tests, do you run in a controllable "safe" environment and risk false-passes, or an environment as close as possible to production and risk intermittent failure?

AstralStorm6y ago

Why not both? Test suite too slow? Live test too dangerous or inconsistent?

rrnewton6y ago

Both this article and this comment thread include a number of different ideas regarding controlling (or randomizing) environmental factors: test ordering, system time, etc.

But why do all of this piecemeal? Our philosophy is to create a controlled test sandbox environment that makes all these aspects (including concurrency) reproducible:

https://www.cloudseal.io/blog/2018-04-06-intro-to-fixing-fla...

invertednz6y ago

We also prioritize the tests, which causes fewer tests to be run which are more likely to fail due to a real defect, which also reduces the number of flaky test results.

pure-awesome6y ago

> A few months back we introduced a game.

What's the game here? It just seems like a process. Useful, sure, but not particularly fun...

boothby6y ago

I understand that, and why, engineers hate this. But it's greatly superior to nothing.

tom-jh6y ago

We run in-browser end to end tests for our browser extension. There were several reasons for flakiness:

* Vendor API again, this time hitting limits on rare situations. We could have less parallel tests, but then you waste more time waiting.

Eventually, we will have to mock up this (fairly complex) API to progress. It's got to a point where I don't feel like adding more tests because they may cause further flakiness - not good.

mariefred6y ago

Flaky tests are indeed a big issue, the main concern being loss of confidence in the results.

The otherwise good advice for randomization has its drawbacks-

- it complicates issue reproduction, especially if the test flow itself is randomized and not just the data

- the same way it catches more issues, it might as well skip some

Something else that was mentioned but not stressed enough is the importance of clean environment as the basis for the test infrastructure.

AstralStorm6y ago

Sometimes it's the code that is flaky and not the test.

mariefred6y ago

if the code is flaky then I have earned my pay honestly, this is a problem that should be solved.

Subtle concurrency issues are indeed very difficult to be found debugged and reproduced and randomization could help with that simply by covering more space.

roland356y ago

I agree I think a large majority of flaky tests, for me at least, stems from some variability in the initial conditions of the test. It is good to uncover all the dependencies!

AstralStorm6y ago

If it's the test that is flaky. But it means that if production is not extremely consistent, you will see these effects live. Better handle them correctly.

notacoward6y ago

Here's a Google testing blog post about the same thing in 2016.

https://testing.googleblog.com/2016/05/flaky-tests-at-google...

pytester6y ago

They could do with being a little more humble and focusing on improving their engineering practices.

zellyn6y ago

If any Googlers are reading this and have the knowledge, I’m curious whether things have improved since that article. The numbers are sobering.

bhuga6y ago

Not a googler, but they posted an update in 2017 with some more information: https://testing.googleblog.com/2017/04/where-do-our-flaky-te...

rellui6y ago

mannykannot6y ago

ArturT6y ago

For annoying flaky features tests, I use rspec-retry gem to repeat the test a few times before marking it as failed. It helped for integration tests with external sandbox API.

I need to try the latest version of discourse code, maybe now it will be more stable to run tests in parallel.

chippy6y ago

boyter6y ago

pavel_lishin6y ago

piokoch6y ago

"Non-deterministic tests have two problems, firstly they are useless, secondly they are a virulent infection that can completely ruin your entire test suite."

"To this I would like to add that flaky tests are an incredible cost to businesses."

I think that the misconception here is that "tests should not fail", because they are "cost", "has to be analyzed and fixed", etc.

In case of unit tests, well, if everything is mocked and isolated then yes, such test probably should never fail, but unit tests are mostly useful only if there is some complicated logic involved.

notacoward6y ago

> An integration or functional test that is guaranteed to never fail is kind of useless for me.

YjSe2GMQ6y ago

In my projects I either fix the nondeterminism or delete such tests.

AstralStorm6y ago

Pseudorandom deterministic tests have their value, presuming you store faulty input and/or seed.

These are not exactly nondeterministic but sometimes people end up with that instead of pseudorandom ones.

rgoulter6y ago

"You won't have code like this obviously contrived example, but you might have code which is equivalent."

Ha, yes! The problem sounds super dumb and obvious once you explain it, but can be a PITA to track down or recognise in the code.

revskill6y ago

To me, unit tests only make sense for pure code.

For impure code, it made no sense to make a unit test.

Ability to separate pure vs impure code determines your test suites, where should be put in unit test, where should be put in integration test.

stagas6y ago

AstralStorm6y ago

Not even close. Functionally pure code can be proven correct instead of tested. Or it can be tested exhaustively. It's the exact case where typical tests are worthless.

Almost all code is impure.

jdlshore6y ago

This is a great article. Grounded in experience, detailed, actionable. Nicely done.

j / k navigate · click thread line to collapse