Running three hours of Ruby tests in under three minutes (opens in new tab)

(stripe.com)

287 pointsnelhage10y ago106 comments

106 comments

We're not nearly at Stripe's scale, but my startup (Spreemo) has achieved pretty amazing parallelism using the commercial SaaS CircleCI. We have 3907 expects across 372 RSpec and Cucumber files. Our tests complete in ~14 minutes when run across 8 containers.

One of the great strengths for CircleCI is that they auto-discover our test types, calculate how long each file takes to run, and then auto-allocate the files in future runs to try to equalize the run times across containers. The only effort we had to do was split up our slowest test file when we found that it was taking longer to complete than a combination of files on the other machines.

I also like that I can run pronto https://github.com/mmozuras/pronto to post Rubocop, Rails Best Practices, and Brakeman errors as comments on Github.

andreasklinger10y ago

i like pronto's approach

we simply added the linters/code analysis to the CI itself

reasoning: we try to have as little as possible "code style" discussion in PRs

clayallsopp10y ago

I'm super curious how Stripe approaches end-to-end testing (like Selenium/browser testing, but maybe something more bespoke too)

My understanding is that they have a large external dependency (my term: "the money system"), and running integration tests against it might be tricky or even undependable. Do they have a mock banking infrastructure they integrate against?

nelhageOP10y ago

This is a great question, and it's definitely a problem we have.

We don't have a single answer we use for every system we work on, but we employ a few common patterns, ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure. We have, for example, our own faked ISO-8583 [1] authorization service, which some of our tests run against to get a degree of end-to-end testing.

Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.

[1] https://en.wikipedia.org/wiki/ISO_8583

heywire10y ago

>Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.

Are you referring to test data or actual live transaction data? The latter would seem like a huge liability and target for hackers.

2 more replies

kevan10y ago

> ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure.

This sounds very familiar, we rely on external credit systems pretty heavily. We started by mocking service responses and including the response XML in our unit tests. Now we have a service simulator that returns expected values and has record/playback capability. It's not ideal and responses get outdated occasionally but we haven't found a more elegant way to handle it yet.

ngoede10y ago

What percentage of the tests are full system, integration, and unit tests?

sanderjd10y ago

I'm very curious about that as well. I worked on a big project that had a (perhaps analogous) large external dependency on networks of embedded devices in homes and businesses, and integration testing it was …difficult. I'd love to hear how Stripe solves that problem.

crdoconnor10y ago

Could you not create mock embedded devices?

1 more reply

andreasklinger10y ago

Not a stripe member but i would assume that anything that involves intense security auditing, PCI, etc would be seperate codebases that rarely change.

(eg cc handling could anon the CCs in a service before they reach the main app)

The integration with 3rd parties is a seperate issue that exists no matter of it is banks or not - i would guess they abstracted that as well as services or libs and decide case by case.

com2kid10y ago

I am tired of this technology having to be re-invented time and time again.

The best I ever saw was an internal tool at Microsoft. It could run tests on devices (Windows Mobile phones, but it really didn't care), had a nice reservation and pool system and a nice USB-->Ethernet-->USB system that let you route any device to any of the test benches.

This was great because it was a heterogeneous pool of devices, with different sets of tests that executed appropriately.

The test recovery was the best I've ever seen. The back end was wonky as anything, every single function returned a BOOL indicating if it had ran correctly or not, every function call was wrapped in an IF statement. That was silly, but the end result was that every layer of the app could be restarted independently, and after so many failures either a device would be auto removed from the pool and the tests reran on another device, or a host machine could be pulled out, and the test package sent down to another host machine.

The nice part was the simplicity of this. All similar tools I've used since have involved really stupid setup and configuration steps with some sort of crappy UI that was hard to use en-masse.

In comparison, this test system just tool a path to a set of source files on a machine, the compilation and execution command line, and then if the program returned 0 the test was marked as pass, if it returned anything else it was marked as fail.

All of this (except for copying the source files over) was done through an AJAX Web UI back in 2006 or so.

Everything I've used since than has either been watching people poorly reimplementing this system (frequently with not as good error recovery) or just downright inferior tools.

(For reference a full test pass was ~3 million tests over about 2 days, and there were opportunities for improvement, network bandwidth alone was a huge bottle neck)

All that said, the test system in the link sounds pretty sweet.

speedkills10y ago

I agree. We already have projects like http://test-load-balancer.github.io but I have a feeling I will see five more posts on he in the next year about re-inventing this wheel and yet not see a single contribution to existing solutions like tlb.

It must be a little depressing to build a really useful product you know many people need, give it away only hoping people will use it and be happy, then find out everyone would rather build their own.

But we do like to build things, it is in our nature. Plus, what looks better on your resume: 1) I migrated my teams test suite to using test load balancer in two days, saving hours every test run. 2) I contributed improvements to the open source test load balancer project. 3) I designed and implemented my own distributed test load balancing tool!

com2kid10y ago

> 3) I designed and implemented my own distributed test load balancing tool!

This, so many times over.

It doesn't help that when interviewing, I kinda-sorta want to know that the people I hire are capable of understanding systems from the ground up. The best way to demonstrate that is to go and build a system from the ground up...

TLB looks cool, nice to see that such a tool exists at least within one eco-system!

hueving10y ago

>I am tired of this technology having to be re-invented time and time again.

So did Microsoft open source this? If not, quit complaining. Just because you saw a massive software engineering company doing something better doesn't mean everyone else who doesn't have access to it sucks for not reaching parity.

com2kid10y ago

> So did Microsoft open source this? If not, quit complaining.

Microsoft has alone re-invented this at least a half dozen times. At least one version of it, more limited in some regards more powerful in others, is sold as part of Visual Studio.

Of course the VS one is both much more "enterprisey" and less flexible in numerous ways.

(That said it does have nice charts.)

The industry as a whole though keeps remaking test frameworks again and again.

I admit that a custom made framework to solve a team's problems is going to be easier to use than an infinitely configurable framework that is designed to solve everyone's problems, Microsoft used to have that tool as well, and it was widely disliked for how little it did out of the box and how much work it required to get it up and running. (Also in its early days it had serious scaling problems, and its configuration + use required a lot of mental gymnastics)

I'm just annoyed that we haven't found a nice simple compromise solution, or at least created some fundamental building blocks.

On top of that so few testing systems pay attention to the user interface, if it takes me 5 minutes to add a single test, damned if I am going to be adding 50 tests.

Lots of test systems go with simple annotations, but then the instant I want something more powerful I am boned. MSTest was restricted like this for years, finally in VS2013 they made it much more extensible, but there is minimal C++ support. Other ecosystems are not a lot better, developers are really good at creating test systems that run on their local dev box, zippity do-da.

Then again I have spent most of my developer life in the devices area, which means test results need to in the very least get sent across the wire to a host machine of some type (depending on the intelligence of one's device under test).

I want my devs to be able to annotate a source file, have IPC code generated on both sides (device, and PC side library), and then have the test auto added to my test management system.

Bah humbug, I think I'll just write a parsing system with Perl and RegExs.

The manually adding tests to the test system part still sucks though. (There is an API for it, but again, mental gymnastics create a barrier to entry).

ryanong10y ago

If you want to implement this locally without using mini-test checkout test-queue by Aman Gupta at github.

https://github.com/tmm1/test-queue

One thing that really sped up our test suite was by creating an NGINX proxy that served up all the static files instead of making rails do it. This saved us about 10 minutes off our 30 minute tests.

tmm110y ago

test-queue supports minitest too, and follows the same basic design outlined in this article: a central queue sorted by slowest tests first, with test runners forked off either locally or on other machines to consume off the queue.

We use test-queue in a dozen different projects at GitHub and most CI runs finish in ~30s. The largest suite is for the github/github rails app which runs 30 minutes of tests in 90s.

ryanong10y ago

Thanks for the awesome work you have done on test-queue. We have really appreciated it at SchoolKeep.

beilabs10y ago

Really interested in this approach, can you point somewhere that talks further about this nginx proxy strategy?

bbuchalter10y ago

Could you share a bit more about this nginx proxy setup for static assets? Basically mimicking production env?

bittersweet10y ago

Seconded, I have not thought about this before or come across this idea at all and certainly sounds interesting!

sytse10y ago

Very cool stuff. For reference at GitLab we use a less impressive and simpler solution. We split the jobs off in https://gitlab.com/gitlab-org/gitlab-ce/blob/master/.gitlab-... These jobs will be done by separate runners, this brought our time down from 1+ hours to 23 minutes https://ci.gitlab.com/projects/1/refs/respect_filters/commit...

teacup5010y ago

How much cheaper (in time, code, effort, complexity) would it be if:

- Their language runtime supported thread-based concurrency, which would drastically reduce implementation complexity and actual per-task overhead, thus improving machine usage efficiency AND eliminating the concerns about managing process trees that introduces a requirement for things like Docker.

- Their language runtime was AOT or JIT compiled, simply making everything faster to a degree that test execution could be reasonably performed on one (potentially large) machine.

- They used a language with a decent type system, significantly reducing the number of tests that had to be both written and run?

100k10y ago

Only if the early engineers could write in a language that they were as productive in as Ruby. Getting Stripe launched was the key thing Stripe needed to accomplish. Everything else follows from that.

teacup5010y ago

There are certainly enough languages to choose from.

1 more reply

pekk10y ago

Thread-based concurrency based on shared mutable state doesn't reduce complexity.

teacup5010y ago

Thread-based concurrency doesn't require shared mutable state at the application implementation level.

someone7x10y ago

Is it fair to assume that time, code, effort, and complexity would be some degree of cheaper? May very well be more expensive. Language choice isn't a silver bullet.

ryanong10y ago

Would it be cheap enough to encourage a re-write, re-engineer the server stack, and re-train employees? I doubt it.

I think it is an interesting question but a bad one most of the time unless you take into account all the other external factors that don't include the language it self.

amalag10y ago

JRuby will do the first and part of the second

brianwawok10y ago

You mean something like Scala?

It would run the tests 10-100x faster per test, and also require less tests (due to having a real type system).

I do giggle a little when I see huge engineering hurdles people have to overcome because of the language that was chosen. Building an app that is going to scale to millions of users? May not want to use Ruby...

(Nothing against Stripe, I am a paying customer - love the product. I do suspect it would be easier to engineer on a better platform than RoR though).

yjgyhj10y ago

One thing I've noticed since coding with immutable data structures & functions (rather than mutable OOP programs) is how tests run really fast, and are easy to run in parallell.

I/O only happens in a few functions, and most other code just takes data in -> transforms -> returns data out. This means I only have few functions that need to 'wait' on something outside of itself to finish, and much lesser delays in the code.

This is coding in Clojure for me, but you can do that in any language that has functions (preferable with efficient persistent data structures. Like the tree-based PersistentVector in Clojure).

schneems10y ago

Immutable data structures give you easy parallelism, however there's a hidden runtime cost: you have to allocate way more objects. For example, I was able to save a ton of object allocations here: https://github.com/mime-types/ruby-mime-types/pull/93 mostly by mutating. For tasks that are not easily parallelizable it may be slower to use immutable structures.

I mostly only ever hear about how fast FP languages are, so maybe they use some tricks to avoid allocations somehow. I would be interested in hearing more about it.

yjgyhj10y ago

Yes, mutating something at one place in memory is more efficient, because you don't need to allocate new memories.

I don't know all to much about other functional languages, as I learned perl -> ruby -> javascript -> little bit C & Java & Go -> now doing Clojure. But I find Clojures collection data structures interesting. The vector (collection like lists or array) type look immutable, but under the hood are trees. When you append to a vector, you seem get a new vector returned.

In reality, you added the new element as a node in a tree. Then just modified pointers to that the new and old version share almost all of the pointers & allocations. With simple arrays or lists, you would allocate every element anew.

Idk if I can properly explain. Found this blog post very interesting and easier to follow: http://hypirion.com/musings/understanding-persistent-vector-...

Personally, most things in business I find easily parallelisable. You mostly decouple the I/O parts of something with the logic parts of it. But yeah I still have much to learn. Thanks for the link! Interesting :)

1 more reply

JBiserkov10y ago

You are of course correct. I recommend you check out this 5 posts on Persistent vectors in Clojure http://hypirion.com/musings/understanding-persistent-vector-... "Spoiler from part 4":

>Transients are an optimisation on persistent data structures, which decreases the amount of memory allocations needed. This makes a huge difference for performance critical code.

http://hypirion.com/musings/understanding-clojure-transients

birdsbolt10y ago

Allocations don't have to be expensive if your GC is smart. Smart as C++ destructors positioning :D

1 more reply

Ono-Sendai10y ago

You can allocate stuff on the stack.

1 more reply

jtchang10y ago

Love this. Sometimes testing can be a huge pain in the ass. I know more than one project I work on where getting them to run is a lot of effort in itself.

There is something to be said about code quality and having tests run in under a few seconds. The ideal situation is when you can have a barrage of tests run as fast as you are making changes to code. If we ever got to the point of instant feedback that didn't suck I'd think we'd change a lot about how we think about tests.

sigil10y ago

We opted for an alternate, dynamic approach, which allocates work in real-time using a work queue. We manage all coordination between workers using an nsqd instance... In order to get maximum parallel performance out of our build servers, we run tests in separate processes, allowing each process to make maximum use of the machine's CPU and I/O capability. (We run builds on Amazon's c4.8xlarge instances, which give us 36 cores each.)

This made me long for a unit test framework as simple as:

    $ make -j36 test

Where you've got something like the following:

    $ find tests/

      tests/bin/A
      tests/bin/B
      ...
      tests/input/A
      tests/input/B
      ...
      tests/expected/A
      tests/expected/B
      ...
      tests/output/

    $ cat Makefile

      test : $(shell find tests/bin -type f | sed -e 's@/bin/@/output/@')
      
      tests/output/% : tests/bin/% tests/input/% tests/expected/%
              @ printf "testing [%s] ... " $@
              @ sh -c 'exec $$0 < $$1' $^ > $@
              @ # ...runs tests/bin/% < tests/input/% > tests/output/%
              @ sh -c 'exec cmp -s $$3 $$0' $@ $^ && echo pass || echo fail
              @ # ...runs cmp -s tests/expected/% tests/output/%
     
      clean :
              rm -f tests/output/*

You get test parallelism and efficient use of compute resources "for free" (well, from make -j, because it already has a job queue implementation internally). This setup closely resembles the "rts" unit test approach you'll find in a number of djb-derivative projects.

The defining obstacle for Stripe seems like Ruby interpreter startup time though. I'm not sure how to elegantly handle preforked execution in a Makefile-based approach. Drop me a line if you have ideas or have tackled this in the past, I've got a couple projects stalled out on it.

atonse10y ago

On a previous project, I had built a shell script that essentially created n mysql databases and just distributed the test files under n rails processes.

We were able to run tests that took an hour in about 3 minutes. It was good enough for us. Nothing sophisticated for evenly balancing the test files, but it was pretty good for 1-2 days of work.

vkjv10y ago

"This second round of forking provides a layer of isolation between tests: If a test makes changes to global state, running the test inside a throwaway process will clean everything up once that process exits."

But, then how do you catch bugs where shared mutable state is not compatible with multiple changes?

praxulus10y ago

You write tests specifically for testing multiple changes. You shouldn't be testing changes to global state by seeing how multiple supposedly independent tests interact.

ambicapter10y ago

Is there any value in designing tests in such a way that they test multiple things at once while still being able to isolate which specific thing is responsible for breaking? I'm thinking of something like JMP.

arturhoo10y ago

Congratulations on what look a very challenging task. I'm assuming a part of those tests hit a database. How have you dealt with it? I assume that a single instance, even on a powerful bare server could be a road blocker in this situation. A few insights on the Docker/Containerization part of it would also be nice!

nelhageOP10y ago

Our testing running infrastructure spins up a pool of database instances on each worker machine, one for each worker process. The test spinup and teardown code handles schema management, hooking into our DB access layer to create and clean up database tables only if they're used by a given test.

Ono-Sendai10y ago

This is an interesting and possibly overlooked problem with using slow languages like Ruby - your unit tests take forever to run. (unless you spend a lot of engineering effort on making them run faster, in which case they may run somewhat acceptably fast)

Aqua_Geek10y ago

This isn't just a problem with Ruby. Our ObjC test suite for a project I work on takes about 10 min to run, too.

Ono-Sendai10y ago

Our main product has approximately 5400 unit tests, over ~150 files. The test suite runs on a single computer in about 14s. This is one of the advantages of using a fast language (C++) with multithreading :)

1 more reply

raverbashing10y ago

I guess a lot of problems come from the stupidly brain dead way people usually write tests (because it's the "recommended TDD way")

Things like using the same setup function for every test and setting up/tearing down for every test regardless of dependencies

Also tests like

    def test1():
      do_a() 
      check_condition_X()

then

    def test2():
      do_a() 
      check_condition_Y()

    def test1():
      do_a() 
      check_condition_X()

    def test2():
      do_a()
      do_b()
      check_condition_Y()

When it could have been consolidated into 1 test

Then people wonder why it takes so much time?

Also helpful is if you can shutdown database setup for tests that don't need it

aianus10y ago

The time it saves me when I see 'test2' failed instead of 'test_enormous:137' failed is worth more than the marginal computation required.

These are embarrassingly parallel problems, we just need better tools to fully saturate every core on every node in the test cluster.

hinkley10y ago

My last project had a mean run time of <9ms per test. We were not at all worried about parallelization. Nobody even mentioned it until we hit 1100 (eleven hundred) tests, and we ended up optimizing the build phase to reduce the code/build/test cycle time instead.

raverbashing10y ago

Though I certainly don't advocate a test that goes to 137 lines, I think the point of having to guess what the test is doing only by the name/messages is moot, you'll end up checking the test source code to see what it is doing exactly

hinkley10y ago

mocha and jasmine (in the node/javascript space) support nested setup and teardown methods and it's been really challenging for me to go back to using other frameworks, languages.

Not only does the nesting help limit the amount of setup and teardown you do, but when broad-reaching functional changes hit you in version 2, 3, it's so much easier to reorganize your tests to get the pre- and post-conditions right when they are already grouped that way.

The sad thing is that it takes a few release cycles before you feel any difference at all, and a couple more before you're absolutely sure that there are qualitative differences between the conventions. So it seems like a pretty arbitrary selection process instead of an obvious choice.

twerquie10y ago

I'm not sure if you're talking about the pain of going back to testing ruby / rails after using mocha / node, but I feel that specific pain, especially on projects with old-school Rails purists who insist on Test::Unit style. Switching to rspec gives you nested describe blocks with shared setup and teardown steps, as nice as mocha. Minitest has this BDD style built in too, but somehow the way Rails ties it in makes it difficult or impossible to take advantage of.

falsedan10y ago

Oh hey, we have the same sort of system here. It's 60,000 Python tests which take ~28 hours if run serially, but we keep it around 30-40 minutes. We wrote a UI & scheduler & artifact distribution system (which we're probably going to replace with S3). We run selenium & unit tests as well as the integration tests.

We've noticed that starting and stopping a ton of docker containers in rapid succession really hoses dockerd, also that Jenkins' API is a lot slower than we expected for mostly-read-only operations.

Have you considered mesos?

badmadrad10y ago

Have you considered another containerization solution like LXD. I feel like testing like this fits the "container hyper-visor" use case and this is what LXD is designed to do.

falsedan10y ago

We tried docker, then had to drop back to running the tests outside of a container (some old technical decisions in the project under test made it hard to run in a container). It's been improved since then, and we're close to running in containers again.

Each executor gets a non-shared prod-like environment thanks to a handful of docker containers. The same setup is used for dev, so switching the testing environment to LXC would mean switching devs as well.

akoumjian10y ago

Anything significantly different in your Python implementation?

falsedan10y ago

Hard to say anything more than what I posted without more details from Stripe.

lawrencewu10y ago

hi frei

falsedan10y ago

Other Dan…

cthyon10y ago

Not sure if this has already been answered, but would Stripe's methods only work with unit tests where tests are not dependent on each other?

How would one go about building a similar distributed testing setup for end-to-end tests where a sequence of tests have to be run in particular order. Finding the optimal ordering / distribution of tests between workloads would certainly be more complicated. Maybe they could be calculated with directed graph algorithms?

matthewmacleod10y ago

How would one go about building a similar distributed testing setup for end-to-end tests where a sequence of tests have to be run in particular order.

I reckon that would be solving the wrong problem. End-to-end tests should be independent of each other, and tests should never be dependent on the order in which they are run. End-to-end tests might be longer as a result, but managing the complexity of test dependencies will quickly cripple any system that uses this approach, I imagine.

givehimagun10y ago

I'd love to know if their integration tests use a database or reference external services of any sort.

We ended up making a compromise where each test can never expect another test to have run...but some tests expect certain test data to be present and in a known state. To handle that, every test cleans up the data of the previous run (Entity Framework has a nice change tracker where we can keep track of the unit of work before it is persisted). We wouldn't be able to parallelize everything though...we can only accept a single test to be active on the DB at a single point in time.

notduncansmith10y ago

I think those would not be considered "unit" tests. Often the definition of unit tests includes the ability to run those tests in any order. Any tests that have to be run in a particular order (i.e. "stateful" tests) should be considered a single test, and likely an integration test at that.

hinkley10y ago

needle scratching on record

They have an average of 9 assertions per test case. I think I may see part of their problem.

junto10y ago

I'm not sure if you are talking from a performance perspective or a conceptual perspective, but this provides a useful discussion on multiple assertions:

http://programmers.stackexchange.com/questions/7823/is-it-ok...

My 2 cents is that multiple assertions are legitimate, as long as they prove a singluar assumption. Hence (as per the test on that page), this is a valid use of multiple assertions:

  [Test]
  public void ValueIsInRange()
  {
    int value = GetValueToTest();

    Assert.That(value, Is.GreaterThan(10), "value is too small");
    Assert.That(value, Is.LessThan(100), "value is too large");
  }

hinkley10y ago

[Edit] Thanks for the link. I have a whole bunch of comments in there to upvote. Guess my evening is planned :) [Edit]

I would also like to point out that ranges, like the one in your example, are almost always a symptom of an unstable test to begin with. I'd want to know why you're providing a range. Does the test blow up if another server is running tests at the same time? Let's fix that so the tests actually fail when there is a failure.

Now, there are lots of matchers that misbehave for corner case inputs, and an assert like "Make sure there's text, that it's a number, and that the number equals 10" may be necessary in order to prove that "10" appears, especially when you invert the test an say "Make sure the number isn't 10". And in this case I would say "write us a better matcher so that everyone can benefit from you figuring out how to do this".

This should go without saying, but I feel I have to repeat it every time there's an audience:

Green is not the end goal of testing. Red when there is an actual problem is the end goal of testing. Anything else is a very expensive way to consume resources.

hinkley10y ago

That's two asserts, and yes you are essentially testing the same concept, which I'm comfortable with as long as it's not a regular thing. People go through all sorts of gymnastics to convince themselves "it's one thing" and I find it exhausting, especially since fixing the problem is usually easier than the rationalizing.

If multiple asserts is a regular thing, you can either start breaking down your tests, or write a custom matcher. The custom matcher gives you better diagnostics when it breaks, so is probably the way to go.

Assert.That(value, Is.Within(10, 100)); // matcher generates error message

chinathrow10y ago

Any reason why a financial infrastructure provider like Stripe would run CI tests on someone elses infrastructure? Isn't that a no go from a security point of view? Or - how do you trust the hosted CI company not to look at your code?

patio1110y ago

how do you trust the hosted CI company not to look at your code?

Contracts, not firewalls, make the world go round.

inopinatus10y ago

Can't upvote this hard enough. It's a classic conceit of secops people that they are the only line of defence against unscrupulous behaviour. Systemic pathologies follow from this misbelief.

c.f. also: "Enterprise Architects", a group of people who think building IT systems qualifies you to redesign an entire organisation.

scrollaway10y ago

To be fair a contract does not guarantee the security framework of the company you are contracting, which means your code is only as safe as their weakest link.

1 more reply

brown9-210y ago

how do you trust the hosted CI company not to look at your code

One can probably assume that they are not relying upon the secrecy of their code for security.

hawkice10y ago

There are other reasons to keep code proprietary than fearing a security failure in the event the code leaks.

1 more reply

jwatte10y ago

If their code is right, everyone in the world reading it wouldn't be a problem.

meesterdude10y ago

I wrote a rubygem called cloudspeq (http://github.com/meesterdude/cloudspeq) that distributes rails rspec spec's across a bunch of digital ocean machines to reduce test execution time for slow test suits in dev.

one of the things I did that may be of interest is to break up spec files themselves to help reduce hotspots (or dedicate a machine to it specifically)

Not as complex or as robust as what they did, but it works!

grandalf10y ago

It's interesting to imagine, for a test suite that would take three hours, how much of the execution time is state management vs algorithm execution.

MrBra10y ago

No, they aren't going to switch to a pure functional language.

jwatte10y ago

http://engineering.imvu.com/2011/01/19/buildbot-and-intermit...

jwatte10y ago

Also, since 2011, we have added features and platforms under test, yet deceased test run time to < 4 minutes. So, yay progress!

rubiquity10y ago

Does this mean each process has its own database or are you able to use transactions with the selenium/capybara tests?

throwaway83297510y ago

Pull-based load balancing is a generally underrated technique.

smegel10y ago

> Initially, we experimented with using Ruby's threads instead of multiple processes

Why, to be cool? Tests are a classic case of things that should be run in isolation - you don't want tests interfering with earth other or crashing the whole test suite. Using separate processes would have been the sensible approach to start with.

edoloughlin10y ago

Was anyone else expecting the article to be about replacing Ruby with a compiled language?

werdnapk10y ago

No.

j / k navigate · click thread line to collapse

106 comments

dankohn110y ago

I also like that I can run pronto https://github.com/mmozuras/pronto to post Rubocop, Rails Best Practices, and Brakeman errors as comments on Github.

andreasklinger10y ago

i like pronto's approach

we simply added the linters/code analysis to the CI itself

reasoning: we try to have as little as possible "code style" discussion in PRs

clayallsopp10y ago

I'm super curious how Stripe approaches end-to-end testing (like Selenium/browser testing, but maybe something more bespoke too)

nelhageOP10y ago

This is a great question, and it's definitely a problem we have.

[1] https://en.wikipedia.org/wiki/ISO_8583

heywire10y ago

Are you referring to test data or actual live transaction data? The latter would seem like a huge liability and target for hackers.

2 more replies

kevan10y ago

> ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure.

ngoede10y ago

What percentage of the tests are full system, integration, and unit tests?

sanderjd10y ago

crdoconnor10y ago

Could you not create mock embedded devices?

1 more reply

andreasklinger10y ago

Not a stripe member but i would assume that anything that involves intense security auditing, PCI, etc would be seperate codebases that rarely change.

(eg cc handling could anon the CCs in a service before they reach the main app)

The integration with 3rd parties is a seperate issue that exists no matter of it is banks or not - i would guess they abstracted that as well as services or libs and decide case by case.

com2kid10y ago

I am tired of this technology having to be re-invented time and time again.

This was great because it was a heterogeneous pool of devices, with different sets of tests that executed appropriately.

The nice part was the simplicity of this. All similar tools I've used since have involved really stupid setup and configuration steps with some sort of crappy UI that was hard to use en-masse.

All of this (except for copying the source files over) was done through an AJAX Web UI back in 2006 or so.

Everything I've used since than has either been watching people poorly reimplementing this system (frequently with not as good error recovery) or just downright inferior tools.

(For reference a full test pass was ~3 million tests over about 2 days, and there were opportunities for improvement, network bandwidth alone was a huge bottle neck)

All that said, the test system in the link sounds pretty sweet.

speedkills10y ago

com2kid10y ago

> 3) I designed and implemented my own distributed test load balancing tool!

This, so many times over.

TLB looks cool, nice to see that such a tool exists at least within one eco-system!

hueving10y ago

>I am tired of this technology having to be re-invented time and time again.

com2kid10y ago

> So did Microsoft open source this? If not, quit complaining.

Microsoft has alone re-invented this at least a half dozen times. At least one version of it, more limited in some regards more powerful in others, is sold as part of Visual Studio.

Of course the VS one is both much more "enterprisey" and less flexible in numerous ways.

(That said it does have nice charts.)

The industry as a whole though keeps remaking test frameworks again and again.

I'm just annoyed that we haven't found a nice simple compromise solution, or at least created some fundamental building blocks.

On top of that so few testing systems pay attention to the user interface, if it takes me 5 minutes to add a single test, damned if I am going to be adding 50 tests.

I want my devs to be able to annotate a source file, have IPC code generated on both sides (device, and PC side library), and then have the test auto added to my test management system.

Bah humbug, I think I'll just write a parsing system with Perl and RegExs.

The manually adding tests to the test system part still sucks though. (There is an API for it, but again, mental gymnastics create a barrier to entry).

ryanong10y ago

If you want to implement this locally without using mini-test checkout test-queue by Aman Gupta at github.

https://github.com/tmm1/test-queue

One thing that really sped up our test suite was by creating an NGINX proxy that served up all the static files instead of making rails do it. This saved us about 10 minutes off our 30 minute tests.

tmm110y ago

We use test-queue in a dozen different projects at GitHub and most CI runs finish in ~30s. The largest suite is for the github/github rails app which runs 30 minutes of tests in 90s.

ryanong10y ago

Thanks for the awesome work you have done on test-queue. We have really appreciated it at SchoolKeep.

beilabs10y ago

Really interested in this approach, can you point somewhere that talks further about this nginx proxy strategy?

bbuchalter10y ago

Could you share a bit more about this nginx proxy setup for static assets? Basically mimicking production env?

bittersweet10y ago

Seconded, I have not thought about this before or come across this idea at all and certainly sounds interesting!

sytse10y ago

teacup5010y ago

How much cheaper (in time, code, effort, complexity) would it be if:

- Their language runtime was AOT or JIT compiled, simply making everything faster to a degree that test execution could be reasonably performed on one (potentially large) machine.

- They used a language with a decent type system, significantly reducing the number of tests that had to be both written and run?

100k10y ago

teacup5010y ago

There are certainly enough languages to choose from.

1 more reply

pekk10y ago

Thread-based concurrency based on shared mutable state doesn't reduce complexity.

teacup5010y ago

Thread-based concurrency doesn't require shared mutable state at the application implementation level.

someone7x10y ago

Is it fair to assume that time, code, effort, and complexity would be some degree of cheaper? May very well be more expensive. Language choice isn't a silver bullet.

ryanong10y ago

Would it be cheap enough to encourage a re-write, re-engineer the server stack, and re-train employees? I doubt it.

I think it is an interesting question but a bad one most of the time unless you take into account all the other external factors that don't include the language it self.

amalag10y ago

JRuby will do the first and part of the second

brianwawok10y ago

You mean something like Scala?

It would run the tests 10-100x faster per test, and also require less tests (due to having a real type system).

(Nothing against Stripe, I am a paying customer - love the product. I do suspect it would be easier to engineer on a better platform than RoR though).

yjgyhj10y ago

One thing I've noticed since coding with immutable data structures & functions (rather than mutable OOP programs) is how tests run really fast, and are easy to run in parallell.

This is coding in Clojure for me, but you can do that in any language that has functions (preferable with efficient persistent data structures. Like the tree-based PersistentVector in Clojure).

schneems10y ago

I mostly only ever hear about how fast FP languages are, so maybe they use some tricks to avoid allocations somehow. I would be interested in hearing more about it.

yjgyhj10y ago

Yes, mutating something at one place in memory is more efficient, because you don't need to allocate new memories.

Idk if I can properly explain. Found this blog post very interesting and easier to follow: http://hypirion.com/musings/understanding-persistent-vector-...

1 more reply

JBiserkov10y ago

You are of course correct. I recommend you check out this 5 posts on Persistent vectors in Clojure http://hypirion.com/musings/understanding-persistent-vector-... "Spoiler from part 4":

>Transients are an optimisation on persistent data structures, which decreases the amount of memory allocations needed. This makes a huge difference for performance critical code.

http://hypirion.com/musings/understanding-clojure-transients

birdsbolt10y ago

Allocations don't have to be expensive if your GC is smart. Smart as C++ destructors positioning :D

1 more reply

Ono-Sendai10y ago

You can allocate stuff on the stack.

1 more reply

jtchang10y ago

Love this. Sometimes testing can be a huge pain in the ass. I know more than one project I work on where getting them to run is a lot of effort in itself.

sigil10y ago

This made me long for a unit test framework as simple as:

    $ make -j36 test

Where you've got something like the following:

    $ find tests/

      tests/bin/A
      tests/bin/B
      ...
      tests/input/A
      tests/input/B
      ...
      tests/expected/A
      tests/expected/B
      ...
      tests/output/

    $ cat Makefile

      test : $(shell find tests/bin -type f | sed -e 's@/bin/@/output/@')
      
      tests/output/% : tests/bin/% tests/input/% tests/expected/%
              @ printf "testing [%s] ... " $@
              @ sh -c 'exec $$0 < $$1' $^ > $@
              @ # ...runs tests/bin/% < tests/input/% > tests/output/%
              @ sh -c 'exec cmp -s $$3 $$0' $@ $^ && echo pass || echo fail
              @ # ...runs cmp -s tests/expected/% tests/output/%
     
      clean :
              rm -f tests/output/*

atonse10y ago

On a previous project, I had built a shell script that essentially created n mysql databases and just distributed the test files under n rails processes.

We were able to run tests that took an hour in about 3 minutes. It was good enough for us. Nothing sophisticated for evenly balancing the test files, but it was pretty good for 1-2 days of work.

vkjv10y ago

But, then how do you catch bugs where shared mutable state is not compatible with multiple changes?

praxulus10y ago

You write tests specifically for testing multiple changes. You shouldn't be testing changes to global state by seeing how multiple supposedly independent tests interact.

ambicapter10y ago

arturhoo10y ago

nelhageOP10y ago

Ono-Sendai10y ago

Aqua_Geek10y ago

This isn't just a problem with Ruby. Our ObjC test suite for a project I work on takes about 10 min to run, too.

Ono-Sendai10y ago

1 more reply

raverbashing10y ago

I guess a lot of problems come from the stupidly brain dead way people usually write tests (because it's the "recommended TDD way")

Things like using the same setup function for every test and setting up/tearing down for every test regardless of dependencies

Also tests like

    def test1():
      do_a() 
      check_condition_X()

then

    def test2():
      do_a() 
      check_condition_Y()

    def test1():
      do_a() 
      check_condition_X()

    def test2():
      do_a()
      do_b()
      check_condition_Y()

When it could have been consolidated into 1 test

Then people wonder why it takes so much time?

Also helpful is if you can shutdown database setup for tests that don't need it

aianus10y ago

The time it saves me when I see 'test2' failed instead of 'test_enormous:137' failed is worth more than the marginal computation required.

These are embarrassingly parallel problems, we just need better tools to fully saturate every core on every node in the test cluster.

hinkley10y ago

raverbashing10y ago

hinkley10y ago

mocha and jasmine (in the node/javascript space) support nested setup and teardown methods and it's been really challenging for me to go back to using other frameworks, languages.

twerquie10y ago

falsedan10y ago

We've noticed that starting and stopping a ton of docker containers in rapid succession really hoses dockerd, also that Jenkins' API is a lot slower than we expected for mostly-read-only operations.

Have you considered mesos?

badmadrad10y ago

Have you considered another containerization solution like LXD. I feel like testing like this fits the "container hyper-visor" use case and this is what LXD is designed to do.

falsedan10y ago

akoumjian10y ago

Anything significantly different in your Python implementation?

falsedan10y ago

Hard to say anything more than what I posted without more details from Stripe.

lawrencewu10y ago

hi frei

falsedan10y ago

Other Dan…

cthyon10y ago

Not sure if this has already been answered, but would Stripe's methods only work with unit tests where tests are not dependent on each other?

matthewmacleod10y ago

How would one go about building a similar distributed testing setup for end-to-end tests where a sequence of tests have to be run in particular order.

givehimagun10y ago

I'd love to know if their integration tests use a database or reference external services of any sort.

notduncansmith10y ago

hinkley10y ago

needle scratching on record

They have an average of 9 assertions per test case. I think I may see part of their problem.

junto10y ago

I'm not sure if you are talking from a performance perspective or a conceptual perspective, but this provides a useful discussion on multiple assertions:

http://programmers.stackexchange.com/questions/7823/is-it-ok...

My 2 cents is that multiple assertions are legitimate, as long as they prove a singluar assumption. Hence (as per the test on that page), this is a valid use of multiple assertions:

  [Test]
  public void ValueIsInRange()
  {
    int value = GetValueToTest();

    Assert.That(value, Is.GreaterThan(10), "value is too small");
    Assert.That(value, Is.LessThan(100), "value is too large");
  }

hinkley10y ago

[Edit] Thanks for the link. I have a whole bunch of comments in there to upvote. Guess my evening is planned :) [Edit]

This should go without saying, but I feel I have to repeat it every time there's an audience:

Green is not the end goal of testing. Red when there is an actual problem is the end goal of testing. Anything else is a very expensive way to consume resources.

hinkley10y ago

Assert.That(value, Is.Within(10, 100)); // matcher generates error message

chinathrow10y ago

patio1110y ago

how do you trust the hosted CI company not to look at your code?

Contracts, not firewalls, make the world go round.

inopinatus10y ago

Can't upvote this hard enough. It's a classic conceit of secops people that they are the only line of defence against unscrupulous behaviour. Systemic pathologies follow from this misbelief.

c.f. also: "Enterprise Architects", a group of people who think building IT systems qualifies you to redesign an entire organisation.

scrollaway10y ago

To be fair a contract does not guarantee the security framework of the company you are contracting, which means your code is only as safe as their weakest link.

1 more reply

brown9-210y ago

how do you trust the hosted CI company not to look at your code

One can probably assume that they are not relying upon the secrecy of their code for security.

hawkice10y ago

There are other reasons to keep code proprietary than fearing a security failure in the event the code leaks.

1 more reply

jwatte10y ago

If their code is right, everyone in the world reading it wouldn't be a problem.

meesterdude10y ago

one of the things I did that may be of interest is to break up spec files themselves to help reduce hotspots (or dedicate a machine to it specifically)

Not as complex or as robust as what they did, but it works!

grandalf10y ago

It's interesting to imagine, for a test suite that would take three hours, how much of the execution time is state management vs algorithm execution.

MrBra10y ago

No, they aren't going to switch to a pure functional language.

jwatte10y ago

http://engineering.imvu.com/2011/01/19/buildbot-and-intermit...

jwatte10y ago

Also, since 2011, we have added features and platforms under test, yet deceased test run time to < 4 minutes. So, yay progress!

rubiquity10y ago

Does this mean each process has its own database or are you able to use transactions with the selenium/capybara tests?

throwaway83297510y ago

Pull-based load balancing is a generally underrated technique.

smegel10y ago

> Initially, we experimented with using Ruby's threads instead of multiple processes

edoloughlin10y ago

Was anyone else expecting the article to be about replacing Ruby with a compiled language?

werdnapk10y ago

No.

j / k navigate · click thread line to collapse