Honestly, as long as the GUI tip remains as small as possible I am mostly fine with whatever shape it takes below there. For modern web applications with a lot of APIs it does make sense to use a trophy. For other applications without such a communication layer a more traditional pyramid does make more sense.
What a lot of people often seem to completely overlook in discussions like this is that the pyramid isn't a goal in itself. It is intended as a way to think about where you place your tests. More specifically place tests where they make sense, provide most value and are least fragile.
Which is why the GUI should be avoided for any test that are testing logic, hence being the smallest section on whatever shape you come up with. Everything else highly depends on what sort of infrastructure you are dealing with, the scope of your application, etc.
Yes, the pyramid was set out as a goal in its original incarnation. That was deeply wrong. The shape ought to be emergent and determined by the nature of the app being tested (i went into detail on what should determine that here https://news.ycombinator.com/item?id=42709404)
Some of the most useful tests Ive worked with HAVE had a large GUI tip. The GUI behavior was the most stable surface whose behavior was clearly defined which everybody agreed upon. all the code got tested. GUI tests provided the greatest freedom to refactor, covered the most bugs and provided the most value by far on that project.
GUI tests are not inherently fragile or inherently too slow either. This is just a tendency that is highly context specific, and as the "pyramid" demonstrates - if you build a rule out of a tendency that is context specific it's going to be a shit rule.
This might be true, but that might also have said something about the layers below that and actually be a symptom for a larger issue within the development organisation.
> GUI tests are not inherently fragile or inherently too slow either.
Compared to testing APIs or Unit tests they are though. Not only do you need to navigate an interface with a machine that is actually intended for humans, you need to also deal with the additional overhead.
Our libraries with a lot of logic and calculations are dominated by unit tests, our libraries that talk to external API are dominated by integration tests. That's just good testing, I'm not sure you need to imagine a pyramid or a vase to decide the tests you do.
My personal rule of thumb is something like: If it makes you go slower, you're doing too little/much testing, if it makes you go faster, you're doing the right amount of testing.
If you find yourself having to rewrite 10% of the code base every time a test changes, you're probably doing too much testing (or not treating your test case as production code). If you find yourself breaking stuff all over the place when doing other things, you're doing too little testing.
As most things, it's a balance, too extreme in either direction will hurt.
Especially if there's something not entirely straightforward about it, like you need to figure out a way to instrument/harness something for the first time so that you can actually test against it. (Arguably inherently doesn't happen with unit tests though I guess.)
I think the problem some devs want hard and fast rules when a lot of the time the right answer is "it depends" and experience dictated actions.
There is the dogmatic rules crowd (triggers my self-diagnosed opposition defiance disorder), and the "it depends" crowd, which left me screaming, "ON WHAT?!"
When I was in this position, I found Kent Beck and Martin Fowler's notion of "Code Smells" [0] really helpful. Though admittedly, the comprehensive enumeration with associated Refactorings was probably a bridge too far.
"Code Smells" lean toward the "it depends" vibe, but with just enough structure to aid in decision making. It also bypasses my inflexible opposition to stupid rules in stupid places.
I try to frame too much or too little testing as a Code Smell and discussing it that way often (not always) leads to reasonably easy consensus related to what we should (or shouldn't) do about it.
> Computers were slower, testing and debugging tools were rudimentary, and developer infrastructure was bloated, cumbersome, and inefficient."
What AMD giveth, Electron taketh away.
No matter how fast computers get, developers will figure out a way to use that extra compute to make the build and the test cycle slower again.
Of course it is all relative - it is hard to define what a "unit" test is when you are building on top of enormous abstractions to begin with.
No matter what you call the test, it should be fast. I feel productive when I can iterate on a piece of code with 2 second feedback time.
This is actually true but the moralistic negative tone and no explanation about it makes me think the writer did not understand why this is happening and why it has both PROs and CONs. It's similar to some other statements I heard before on this subject "It's pointless to add/increase roads, there will always be traffic". It's true there will always be traffic but it's not pointless. There will always be traffic because moving more cars becomes faster so more people do it. You should consider though that traffic on a single lane helped 100 people while traffic on a 2 lanes street helped many more people. The same is true for software development. Computers get faster, but programs tend to stay around the same amount of perceived speed. Like roads increase and yet there is still the same amount of traffic. When computers get faster it means that developers can write code faster and so they can write more code and/or cheaper code. Writing programs becomes also cheaper so developer need to be less expert and trained. The computer that brought astronauts to the moon was probably less powerful than today's smart thermostat. Yet to land on the moon with that computer required a team of people that were likely at phd level, intensely focus and dedicated and they were all socially and culturally adjacent to the inventor of the computer. By comparison, today's programs do trivial things using immense resources. And yet because many more developers can code, there are also immensely more programs about millions of use cases that are developed all over the world, by people that do not even speak English in some cases.
So programs did become less efficient because the true bottleneck was not the efficiency of the program. The true bottleneck was developer hours and skills.
This doesn't mean that it's okay for all programs to be slow or that you should be satisfied in using programs that you perceive as slow. The correlation between speed/efficiency of a program and its UX it's a Bell curve. At the beginning the faster it gets the better the UX. After a certain speed though the UX improves marginally. If the final user cannot distinguish between the speed of two different programs it means the bottleneck is not anymore about speed and another characteristic becomes the bottleneck. This said, there will always be work for efficiency engineers or low level developers to write more performant code. But not all code will require to be written as efficiently as possible.
The time it takes to booting an operating system, start a program, compile a program or run a test suite seems to remain somewhat constant over my career.
It indicates that the determining factor is not the clock speed of the underlying system but instead the pain tolerance of the users or developers.
In addition to that, I think a major point is that the testing pyramid was conceived in a world where desktop GUI applications ruled. Testing a desktop GUI was incredibly expensive and automation extremely fragile. That is in my opinion where the pointy tip of the pyramid came from in the first place.
"But the majority of tests are of the entire service, via its API [..]"
I think this is where you get the best bang for your buck because your goal to keep your tests robust is well aligned with the goal to keep the API stable. This is not the case above and below, where the goal of robust tests is always at odds with change, quick adaption and rapid iteration.
Sometimes that requires E2E tests, sometimes that's integration or unit tests.
My preference is to use something like functional core/imperative shell as much as possible, but the more external dependencies you have the more work you have to do to create an isolated environment free of IO. Not saying it isn't worthwhile, but sometimes it's easier to simply just accept that the tests will be slower due to relying on real endpoints and move on. After all, tests should support velocity, not be an end in and of themselves.
We like units because they are fast, deterministic, parallelisable... all the good stuff. Relative to that ideal, integrations are slower, flakier, more sequential, etc.
While I've never gone full-TDD, those guys have it absolutely right that testability is a design/coding activity, not a testing activity. TDD will tell you if you're writing unit-testable or not, but it won't tell you how. Dependency-inversion / Ports-and-adapters / Hexagonal-architecture are the topics to read on how to write testable code.
What's my personal stake in this? Firstly, our bugfix-to-prod-release window is about four hours. Way too long. Secondly, as someone relatively new to this codebase, when I stumble across some suspicious logic, I can't just spit out a new unit test to see what it does, since it's so intermingled with MS-SQL and partner integrations. Our methods pass around handles to the DB like candy.
So what I think has happened here, is that we generally don't think about writing testable code as an industry. Therefore our code is all integrations, and no units. So when we go to test it, of course the classic testing pyramid is unachievable.
I have never seen a formal definition. Without that we cannot have any discussion.
To some a unit is a function. To some it is a module (generally someone else's module). To some it is an entire application. To some it is the entire computer in your embedded device. To some it is the entire device... Most people have no clue what we are talking about and don't car (also should not care).
writeFile(fname,"Hello, World") is only one thing, but its behavior will depend on the state of the filesystem, so it's not a unit.
parseComplicatedObject(bytes) could be a unit (even if it calls out to many other sub-parsers - as long as they are also units).
One thing I see in a lot of companies is efforts to reduce test flakiness. Devs will attempt this work in src/test. But if the code is flaky, and you change the test from flaky to reliable, then you have just decreased the realism of your test. Like I mentioned with my comment above, you reduce test flakiness by doing the hard work in src/main. The src/test changes should be easy after that.
Since then, significant progress in both technology and development practices has transformed testing in three key ways:
1) It’s now possible to run a wide range of tests on an application very quickly through its public interface, enabling a broader scope of testing without excessive time or resource constraints.
2) Improved test frameworks have brought down the cost and effort required to write robust, maintainable integration tests, offering accessible, scalable ways of validating the interplay between components.
3) The development of sophisticated debugging tools and enhanced observability platforms has made it much easier to identify the root causes of failures. This improvement reduces the reliance on narrowly focused unit tests.
These assertions are simply made, not argued or justified. Maybe they apply to the code they're writing? I don't think they apply to my code.Is is very specific to the business we are in: "move fast and occasionally break things" is acceptable.
OTOH we have focused a lot on adding type safety to avoid many basic mistakes. Exceptions to result-types, no more implicit nulls, replace JS with Elm, replace Java with Kotlin, replace SQL-in-strings with jOOQ, and a culture of trying to write code that does not allow bad states to be expressed. This had precedence on writing extensive test suites.
> 2) Improved test frameworks have brought down the cost and effort required to write robust, maintainable integration tests, offering accessible, scalable ways of validating the interplay between components.
> 3) The development of sophisticated debugging tools and enhanced observability platforms has made it much easier to identify the root causes of failures. This improvement reduces the reliance on narrowly focused unit tests.
Citation needed on all of these. Where are the specific tools that make running all of these magically fast and reliable (read: not flakey) integration tests possible?
There is value in every level.
Let's keep the pyramid but rename the segments!
I like to design my applications so all slow components can be mocked by faster alternatives, and have the HTTP stack as thin as possible so I can basically call a function and assert the output, while the output closely resembles the final HTTP response, either rendering a template with a blob of data, or rendering the blob of data as JSON.
If you're testing large, complex services that involve many different behaviours, you're still going to have a test pyramid. If you've implemented microservices, what you used to call an integration test has now become an e2e test in your new architecture. And you still don't want to have mostly that.
With all that AI generated code being pushed, as a leader I wonder which is better? Enforce a ton of e2e so no code that is really well thought through all aspects of the solution can go past CI or does this enable AI to go even crazier and break all sort of best practices to just pass the test?
AI capability drops sharply once the context gets too big. Iterating with an AI against an E2E means involving enough context that you're likely to run into problems with the AI's capability, but even if not it means that there's a lot more space for creativity before you get the signal that you've gone too far.
It's too easy to forget that you've omitted a crucial file from the context and instead be iterating on increasingly desperate prompts--it's the kind of mistake you want to catch and correct early, so again: as small boxes.
For these reasons, I think lots of E2Es is the wrong play, because it creates big boxes.
If I were leading a team of AI-using devs I'd be looking for ways to create higher fidelity constraints which can then form part of that box, or which interrupt the cases where we get lazy and let it be unconstrained by any requirement except that nobody has screamed about it yet.
This would be stuff like having teams communicate their needs to one another by creating ignored failing tests in the other team's repo such that they can be un-ignored once they pass. Or ensuring that the designs aren't just user focused but include the kind of things that end up getting added directly to the context without being re-interpreted by the dev (e.g. files defining interfaces, or terse behavioral descriptions). Such that devs on different teams are including the same design artifacts in the content while they build adjacent components.
It's like AI generated code is a gas that will fill the available space, so it's the boundaries that require human focus. For this reason I disagree with the article. E2Es and ITs are too slow/expensive to run often enough to be useful constraints for AI. Small tests are way better when you can get away with them.
Integration tests are better replaced by something like contract testing IMO to still retain the test parallelism.
They have a specific definition of "E2E" (apparently UI is not considered to be in it) and it works on docker platform only (so not for e.g. windows binaries). It can be good, but does not speak about the testing pyramid in general.
- Integration tests are expensive to run and take time to write; therefore it is important to maximize their value. The ultimate integration test is an end to end test because it maximizes the scope what is under test and the potential for weird feature interactions to trigger exactly those kind of failures you want to find with such a test.
- Unit tests are orders of magnitude cheaper to run; so have lots of them but make sure they are easy to maintain and simple so they minimize time spent on them.
- Anything in between is a compromise between shooting for realism vs. execution speed. Still expensive to run and maintain but it just does not deliver a lot of value.
- Test coverage becomes exponentially harder with the size of the unit you are testing. Test coverage for integration tests is a meaningless notion. With end to end integration tests you shoot for realism, not coverage. They should cover things that users of your system would use in ways that they would use them.
- Mocking and faking is needed to unit test code that violates the SOLID principles and is otherwise hard to test. So they have the development overhead of an integration test but they deliver the value of a unit test. This is not ideal. It's better to unit test code that is very testable and cover the rest with integration tests that deliver more valuable insight. Lots of very complex unit tests are hard to develop and limited in value.
I just removed the one remaining test that used mockk in my Kotlin code base. I have hundreds of API integration tests. And lots of simple unit tests. I focus my unit tests on algorithms, parameters, and those sort of things. My integration tests ensure the system does what it needs to.
I run integration tests concurrently so they complete quickly. This increases their value because it proves the code still works if there's more than one user in the system.
Having said that, I think this mostly means that people find the term "unit" in "unit tests" ambiguous and they're just cargo culting it to mean "a single class" or whatever. That's the fundamental flaw that should be addressed. Basically that's what the article is saying as well, I guess, by implying that the API is the contract you should be testing, etc. But that is essentially just a long winded way of saying "The API is the unit".
https://www.youtube.com/watch?v=k-t4OiEHCiA (at 5:25)
... which seems appropriate.