3 percent of Python codebases we checked had silently failing unit tests (opens in new tab)

(richardtier.com)

98 pointsrikatee4y ago66 comments

66 comments

We work with a rather large upstream scientific codebase in python. They have an innate distrust of anything that isn't written by them.

Their testing system depends on tests printing "OK" after every test. This means that in many cases, tests failing are indicated by the _absence_ of "OK" being printed.

(We've attempted to isolate those parts and write our own stuff testing against upstream in pytest. We once presented a proposal to move them to pytest, offering to do any work and even wrote pytest plugins to seamlessly integrate with their current system. We got a - literal - "Thanks, but no thanks.")

chriswarbo4y ago

> Their testing system depends on tests printing "OK" after every test.

Oof. If they'd instead put "ok" before every test, they might have been accidentally compatible with TAP! https://testanything.org

rightbyte4y ago

I think relying on stout is kinda fine. I do tests with TCL/Expect like that.

It is nice to not have to depend on the language runtime to do the test.

danuker4y ago

There is a point where enough mismanagement makes relying on the said scientific codebase a liability rather than an advantage.

You could even be a victim of "Embrace, Extend, Extinguish".

My advice is to consider forking it, and poaching contributors, in the interests of common good.

adenozine4y ago

And you’re unwilling to name the library?

I’d love to investigate this further.

popularonion4y ago

That’s not the way I would do it, but is there really a problem here? Assuming the library itself isn’t littered with print statements that cause false positives.

Experience has taught me that the “right” testing framework for a project is whatever the developers are happy and productive with.

oli56794y ago

Did they catch exceptions and then print 'not ok'?

danuker4y ago

> tests failing are indicated by the _absence_ of "OK" being printed

I guess they are counting OKs then bisect their test suite until they find the "not ok" test.

byteface4y ago

I just grepped some sitepackages and saw this. Is it an example of what the author is saying?... https://github.com/gitpython-developers/GitPython/blob/main/...

If so that would appear common as came up right away.

rikateeOP4y ago

yes that's exactly the kind of problem :)

forgotusername64y ago

I've had unit test suites in the past that failed to run the test if the test failed to compile. Those were the worst. I only found out because I roughly knew how many tests I expected to be run

ianbicking4y ago

During development I sometimes end my tests with assert 0, and only once I get to that failure do I know I've finished

dc-programmer4y ago

I’ve shipped unit tests that would fail but don’t run by copying the function signature of the previous test then forgetting to change the name of the new one. Only one of the tests will run (the first?).

Maybe unit tests need unit tests? (There’s probably a lint rule to catch what I describe above)

jka4y ago

> Maybe unit tests need unit tests? (There’s probably a lint rule to catch what I describe above)

Yep - meta-testing (ensuring that every unit test that exists in a project adds unique coverage, remains valid, runs as expected, and I'm sure many other properties) could (and should!) definitely be automated.

Some more advanced meta-testing could involve tracking changes to a project's source history over time (in other words: tests that run with commit history). By that I'm thinking of situations like: "does this test genuinely still test what it used to, after the test and/or application code was modified?"

rcxdude4y ago

mutation testing is one example: if you make random changes (random in terms of transforming valid code to different valid code) to the code being tested, you should expect that the test will then fail. If not, there is some part of the code's behaviour which is not being tested.

dc-programmer4y ago

Wow, that sounds like the future of testing. It’d be a hard to sell to manager now though. Some of those checks seem like they could be auto-generated though

pure_simplicity4y ago

The second one will run, because as the file gets executed top down, the second declaration overwrites the first declaration, just like when you reassign a variable.

But yeah that would be a good thing for a linter to catch. I'm not aware if any do.

njharman4y ago

Huh? Seems super clear to me that "assertTrue" is asserting Truthyness and not equality. It's right there in the method name! And if you don't know "True" means Truthyness in Python, they you don't know the basics of Python.

A reviewer should catch this error easily. I kind of think many don't give much attention to unittests when reviewing. Which is bad. Good unittests are far harder to write than good code.

There's much more subtle errors of this class (False Negatives / always pass).

js24y ago

Given that this is a not uncommon mistake, despite the name, indicates that people make mistakes. They don't read. They're in a hurry. They see what they want to see.

The fix isn't to blame people for making mistakes. It's to figure out a design that doesn't allow this mistake to happen in the first place.

For example, the method could (today) require the second argument to be a keyword argument. This is also something a good linter should be able to warn on.

edit: rikatee and I wrote essentially the same reply at the same time. :-)

rikateeOP4y ago

Agreed in perfect world, but unfortunately any process that involves humans will involve human error.

We do code review because we expect human error when the code was written by a human, but then we also expect not human error when the code is being read (reviewed) by a human? Any process that expects zero human error will always fail.

That's where linters add value: they allow devs to do what humans are good at (the creative complex and interesting stuff) while the bots do what bots are good at (the boring repetitive stuff)

krab4y ago

Why Shroeder's test? Does it have anything to do with Gerhard Schroeder? I fail to see what. :-)

Retr0id4y ago

I wonder if they meant Schrödinger - the test could be both passing or failing, but we don't know until we use the correct function to check the results.

tialaramex4y ago

Note that Schrödinger's thought experiment is intended to ridicule this way of thinking. Schrödinger is trying to suggest that since it's clearly nonsensical to imagine that maybe a whole cat can be both dead and alive the same would be true for other macroscopic subjects.

Instead popular culture has decided that at best, this is what Schrödinger believed (Ha those crazy scientists) and at worst that somehow the cats being dead and not-dead at the same time is the core idea of quantum physics :/

2 more replies

rikateeOP4y ago

you're right thanks! Updated :)

carpenecopinum4y ago

That typo/misread (Schroeder instead of Schrödinger) honestly almost causes it to make more sense to me. Because I don't really see how those tests relate to quantum mechanics. Instead comparing those tests to a chancellor of the local labor party that was expected to help the situation of workers in the country, but essentially only used the office as a stepping stone to become a russian oligarch, making situation even worse in the process, makes plenty of sense to me...

rikateeOP4y ago

oh dear time for a edit as I did indeed mean Schrödinger!

ajuc4y ago

I thought you meant the villain from ninja turtles ;)

1 more reply

contravariant4y ago

Maybe it's supposed to be Ernst Schroeder? Though I don't see the immediate connection.

generic-husky4y ago

Nobody's going to talk about the cute fursona?

rikateeOP4y ago

we were all waiting for you to

dragonwriter4y ago

Though it's unlikely to get made in the existing testing library because it's hugely breaking, the API would be better if the assertXxx methods’ optional message argument were keyword-only, and assertTrue (and assertFalse) were replaced with assertIsTrue and assertTruthy (and assertIsFalse and assertFalsey.)

robertlagrant4y ago

Good lord I forgot how ugly pre-pytest was.

hyperzeit4y ago

How come, pytest does not appear in the post?

rikateeOP4y ago

the post covers the built-in unittest package, which 28% of devs still use. But pytest is nicer to work with. I think brownfield codebases and inertia are the reason 28% of devs work (or have to work) with unittest

2 more replies

boxed4y ago

This (and so much more) will be caught by mutation testing. For python that means mutmut.

pc864y ago

I'm no Python expert, but why does assertTrue() even accept two arguments?

js24y ago

Explained later down in the post:

> assertTrue also accepts a second argument, which is the custom error message to show if the first argument is not truthy. This call signature allows the mistake to be made and the test to pass and therefore possibly fail silently.

justinsaccount4y ago

The 2nd argument is the 'msg' argument.

With modern features in python you could change the signature to

  assertTrue(expr, *, msg=None)

which would prevent that issue.

d0mine4y ago

or just:

    assert expr, "custom message"

though given the verbose api, it is ok to require the explicit msg kwarg (duplication in the tests is ok if it makes them more robust)

guilherme-puida4y ago

The second argument is the message that is displayed if the value is not truthy.

cyberia234244y ago

Second arg is a string message to display when test fails

6LLvveMx2koXfwn4y ago

Indeed, from TFA;

"assertTrue also accepts a second argument, which is the custom error message to show if the first argument is not truthy. This call signature allows the mistake to be made and the test to pass and therefore possibly fail silently."

charcircuit4y ago

Does unittest cause an exception when the second argument isn't a string? That would catch some of these I'd imagine.

lozenge4y ago

They should make `msg` a keyword-only argument on all the TestCase.assert* statements. It wouldn't be the first breaking change of this type.

Searching for announced breaking changes about arguments to Python included functions...

https://bugs.python.org/issue25628

https://bugs.python.org/issue29193

https://bugs.python.org/issue36492

https://bugs.python.org/issue29209

rikateeOP4y ago

it does not, it just prints them out in the console

dingosity4y ago

well. yeah. they were probably written for 2.x or 3.y and being run in 2.(x+n) or 3(y+n).

dcdc1234y ago

Only 3 percent?

rikateeOP4y ago

of the codebases checked yep (20 of 666 checked).

Bear in mind only 28% of codebases actually use built-in unittest package that this gotcha is affected by, so really it's 20 of 28% of 666 aka 10% ... but that claim would be hard to justify by folks that dig stats.

j / k navigate · click thread line to collapse

66 comments

misnome4y ago

We work with a rather large upstream scientific codebase in python. They have an innate distrust of anything that isn't written by them.

Their testing system depends on tests printing "OK" after every test. This means that in many cases, tests failing are indicated by the _absence_ of "OK" being printed.

chriswarbo4y ago

> Their testing system depends on tests printing "OK" after every test.

Oof. If they'd instead put "ok" before every test, they might have been accidentally compatible with TAP! https://testanything.org

rightbyte4y ago

I think relying on stout is kinda fine. I do tests with TCL/Expect like that.

It is nice to not have to depend on the language runtime to do the test.

danuker4y ago

There is a point where enough mismanagement makes relying on the said scientific codebase a liability rather than an advantage.

You could even be a victim of "Embrace, Extend, Extinguish".

My advice is to consider forking it, and poaching contributors, in the interests of common good.

adenozine4y ago

And you’re unwilling to name the library?

I’d love to investigate this further.

popularonion4y ago

That’s not the way I would do it, but is there really a problem here? Assuming the library itself isn’t littered with print statements that cause false positives.

Experience has taught me that the “right” testing framework for a project is whatever the developers are happy and productive with.

oli56794y ago

Did they catch exceptions and then print 'not ok'?

danuker4y ago

> tests failing are indicated by the _absence_ of "OK" being printed

I guess they are counting OKs then bisect their test suite until they find the "not ok" test.

byteface4y ago

I just grepped some sitepackages and saw this. Is it an example of what the author is saying?... https://github.com/gitpython-developers/GitPython/blob/main/...

If so that would appear common as came up right away.

rikateeOP4y ago

yes that's exactly the kind of problem :)

forgotusername64y ago

I've had unit test suites in the past that failed to run the test if the test failed to compile. Those were the worst. I only found out because I roughly knew how many tests I expected to be run

ianbicking4y ago

During development I sometimes end my tests with assert 0, and only once I get to that failure do I know I've finished

dc-programmer4y ago

Maybe unit tests need unit tests? (There’s probably a lint rule to catch what I describe above)

jka4y ago

> Maybe unit tests need unit tests? (There’s probably a lint rule to catch what I describe above)

rcxdude4y ago

dc-programmer4y ago

Wow, that sounds like the future of testing. It’d be a hard to sell to manager now though. Some of those checks seem like they could be auto-generated though

pure_simplicity4y ago

The second one will run, because as the file gets executed top down, the second declaration overwrites the first declaration, just like when you reassign a variable.

But yeah that would be a good thing for a linter to catch. I'm not aware if any do.

njharman4y ago

A reviewer should catch this error easily. I kind of think many don't give much attention to unittests when reviewing. Which is bad. Good unittests are far harder to write than good code.

There's much more subtle errors of this class (False Negatives / always pass).

js24y ago

Given that this is a not uncommon mistake, despite the name, indicates that people make mistakes. They don't read. They're in a hurry. They see what they want to see.

The fix isn't to blame people for making mistakes. It's to figure out a design that doesn't allow this mistake to happen in the first place.

For example, the method could (today) require the second argument to be a keyword argument. This is also something a good linter should be able to warn on.

edit: rikatee and I wrote essentially the same reply at the same time. :-)

rikateeOP4y ago

Agreed in perfect world, but unfortunately any process that involves humans will involve human error.

That's where linters add value: they allow devs to do what humans are good at (the creative complex and interesting stuff) while the bots do what bots are good at (the boring repetitive stuff)

krab4y ago

Why Shroeder's test? Does it have anything to do with Gerhard Schroeder? I fail to see what. :-)

Retr0id4y ago

I wonder if they meant Schrödinger - the test could be both passing or failing, but we don't know until we use the correct function to check the results.

tialaramex4y ago

2 more replies

rikateeOP4y ago

you're right thanks! Updated :)

carpenecopinum4y ago

rikateeOP4y ago

oh dear time for a edit as I did indeed mean Schrödinger!

ajuc4y ago

I thought you meant the villain from ninja turtles ;)

1 more reply

contravariant4y ago

Maybe it's supposed to be Ernst Schroeder? Though I don't see the immediate connection.