How we made a Ruby method 200x faster (opens in new tab)

(campsite.com)

65 pointsnholden1y ago45 comments

45 comments

The title kinda glossed over the fact that they started out with working, fast code, and then broke it. Sure, their fix was faster than their most broken version, but it's less impressive than starting with slow code and improving it.

bastawhiz1y ago

They said they made the change because the code was starting to become hard to maintain. That's not a terrible reason for refactoring.

notjoemama1y ago

I think they were referring to the degree of speed up.

1 more reply

benmmurphy1y ago

I think this might also be a case of a phenomena i see frequently especially in chess and searching for vulnerabilities. If someone gives you a chess puzzle and tells you to find the solution its often much easier to do than finding tactics in your own game. i think if you gave a developer who had some understanding of CSS and asked them what the performance problem was they would be able to identify the `matches?` method as the culprit. but if you wrote this code or reviewed this code and you didn't know there was a problem then i think identifying the `matches?` method as a problem would be more difficult. i've been thinking it might even be free to just assume there is a tactic in a chess position or a problem in some piece of code and use that to change your frame of mind to spot these issues. but i don't think this works out in practice because you might be able to change your frame of mind but this comes at a cost of spending more time and focus on the task.

wutwutwat1y ago

They used their metrics to confirm the regression, and flame graphs to see where the cpu was spending its time. That’s performance tuning basics and anyone who is familiar with profiling code should be able to find the slow spots.

I’ve been doing Ruby for a decade and the matches? method immediately stood out to me. Methods doing any kind of comparison/regex or looping usually end up causing problems. Nokigiri was walking a tree of child nodes and doing a string comparison. I’ve seen people write themselves similar problems with methods like any? includes?, excludes?, etc. Par for the course.

Alifatisk1y ago

That's a huge improvement but damn, the fixed code didn't look any better in my eyes.

Going from

  HANDLERS = [
    Text,
    List,
    ListItem,
    Code,
    # ...
  ].freeze

  HANDLERS_BY_NODE_NAMES = [
    Text,
    List,
    ListItem,
    Code,
    # ...
  ].each_with_object({}) do |handler, result|
    handler::NODE_NAMES.each { |node_name| result[node_name] = handler }
  end.freeze

viraptor1y ago

I'd go with this since it's not performance critical code. Not sure if it's that more readable, but I like it better:

    BY_NODE_NAMES = HANDLERS.map {|h|
      h::NODE_NAMES.map {|n| [n, h]}
    }.flatten(1).to_h

ngcazz1y ago

I think we can go harder on the std lib :)

  HANDLERS.flat_map { _1.node_names.index_with(_1) }.inject(&:merge)

(nb: assuming there exists a `.node_names` to expose the constant... just because I like always using method calls)

masklinn1y ago

A bigger question for me would be why the handlers don’t register themselves. It should be a very small amount of meta-programmation, and would avoid having to repeat the handlers to register them.

IshKebab1y ago

Self-registration is usually an anti-pattern in my experience because it introduces globals. Sometimes you can't avoid it, but if you only have a few things to register it's usually better to just list them explicitly.

2 more replies

hartator1y ago

You can also switch to Nokolexbor, our drop-in replacement for Nokogiri: https://github.com/serpapi/nokolexbor

It should almost 1,000 faster for this kind of CSS lookups.

kevmo3141y ago

I'm curious how the case/when version performs. Unlike the author, I don't think that is any smellier than the list/map solution they've come up with.

chikere2321y ago

"we improved performance with a simple trick! (rollback our changes)"

yayoohooyahoo1y ago

The title implies they fixed a method in Ruby itself which would have been a lot more interesting than this article.

cluckindan1y ago

This should have been obvious before the fact to anyone who understands how CSS selectors work in browsers.

As in, they are matched right-to-left, which implies that a selector like ”p a” first selects all the <a> nodes, and for each of them, it then traverses up the DOM tree until it encounters a <p> node (selector matches) or the root node (selector doesn’t match).

That said, the traversing shouldn’t happen for plain tag selectors like ”h1”. There must be something wrong with the library they used.

rco87861y ago

> This should have been obvious before the fact to anyone who understands how CSS selectors work in browsers.

For sure, but that's somewhat reductive. This is not exactly common knowledge. Certainly not something you would expect any given engineer to have immediately jump to mind.

the_other1y ago

Is that really how it works in browsers and other rendering engines?

Intuition suggests to me that it wouldn’t start with CSS and then find all the matching DOM nodes. I would expect it started at each DOM node and then found the CSS rules which might apply.

So “I’m adding an A to the tree; what are all the CSS rules with or A or * at the rightmost token; which of that set applies to my current A; apply the rules in that sub set”. Going depth first into the DOM like this should result in skipping redundant CSS, and (as my imagination draws it) reduce DOM traversals.

esprehn1y ago

There's three different modes of running a selector in typical browsers:

  (a) Element#matches
  (b) Element#querySelector(All)
  (c) By the engine for updating style and layout

The GP seems to be talking about (b), but even then browsers are checking each element one by one not advancing through the selector state machine in parallel for every element. (There's one exception in the old Cobalt which did advance the state machines IIRC).

(a) and (c) are conceptually very similar except that when doing (c) you're checking many elements at the same time so browsers will do extra upfront costs like filling bloom filters for ancestors or index maps for nth-child.

In TFA they're doing .matches() which I would expect to be slower than a hash map lookup, but for a simple selector like they're doing (just tag name) it shouldn't do much more then:

  (1) Parse the selector, hopefully cache that in an LRU
  (2) Execute the selector state machine against the element 
    (2.1) Compare tagName in the selector

Apparently Nokogiri implements CSS in a very inefficient way though by collecting ancestors and then converting the CSS into xpath and matching that:

https://github.com/sparklemotion/nokogiri/blob/e8d30a71d70b2...

I'd expect that to be an order of magnitude slower than what a browser does.

cluckindan1y ago

In browsers, DOM parsing starts before (all) CSS is loaded and parsed. Also, the sizes of elements in the flow are (by default) dictated by the text content, so it really does not make sense to try to paint a page in a root-to-leaf order.

franciscop1y ago

I've flamegraph-debugged JS code from time to time, and it usually feels a lot more of a craft and "educated guesses" than the vast majority of programming things I do. I usually only get down to it when there's an actual perf problem so YMMV, but I'm curious, do I do JS flamegraph debugging wrong, or is it something like this for everyone?

- 20% of the times you get lucky and find a very easy win that speeds up things 90%+. Similar to this post, usually when a single method/call takes a huge chunk of the work.

- 50% of the times you grind at it and can get 30-50% speed up. I usually try many things, and only some of them do make a difference.

- 30% of the time absolutely no luck! Many small calls where each is unavoidable, no repeated code, etc.

swatcoder1y ago

Keep in mind that there are many layers of complex systems between your JS code and what'll end up happening on your system when it's run.

The code defines what the state should look like after its done executing. It expresses your intent. But that code gets transformed several times on the way to being executed and then the hardware can apply mang different possible approaches to executing it when the time comes.

Moreso every year, many of those software transformations, as well as the hardware's execution technique, are quite aggressive about revisiting your program's intent with optimizations (of some kind) that make sense within that context.

The upshot is that the farther you are from your hardware, the more of these layers there are between your code and its execution, the less influence and insight you have over what actually "physically" happens during execution.

When it comes to profiling and optimization of high-level programs like those written in Javascript, this means that it can ve somewhere between hard and impossible to predict how your code changes will actually impact performance.

Radical algorithm redesign can often yield salient diffferences that feel largely predictable, but smaller "precision" changes are often going to be a crap shoot. All those layers between you and the hardware were making optimzations already anyway, and your "precision" change may just as easily confound those existing optimizations as well as it might trigger some other. The results are tricky.

This is even true in lower-level code, where we're encouraged to do things like inspect compiler output on godbolt or in our compilation output and always confirm our expectations with a profiler (which often proves our guesses wrong). But it's all that much more pronounced in high-level ones.

So ultimately, yes, assuming your prevailing algorithms are generally optimal, profiling and optimization is almost always going to feel like a guess-and-test process. But that's okay, because you can test and those tests are usually (not always) telling you if you've made a meaningful difference or not.

jesse__1y ago

I do low level systems programming, so pretty different from JS-land, but I feel the techniques you should apply when doing optimization generally apply at any level/language.

0) algorithmic improvement. Obvious shit like do a quick sort instead of bubble sort (assuming N > 64, or whatever), not doing unnecessary work in a hot loop, etc

1) reduce memory footprint. The slowest part of your program is almost always just waiting for memory, unless you're doing something that's heavily CPU bound. Web applications are probably always memory bound. Reducing the amount of memory the function you're optimizing operates on reduces DCache misses, which are expensive.

2) Do batch operations. Once I've got something to a point where it's not completely braindead (which, honestly, is where I stop most of the time), I look to start batching things. Usually look to do 8 or 16 at a time in the hopes the compiler/runtime can make some use of SIMD. Use STATIC LOOPS ie (for 0..8) so the compiler can unroll the loop. That's extremely important.

3) probably unavailable (unless you want to/can drop into WASM), but the next step is usually SIMD. This is a rabbit hole, but if you want/need another ~8x perf improvement, this is how to get it

4) once all that's done, it's probably close to optimal in terms of cycles per element (unless I did something boneheaded, which is common). Last step is to multithread it if it needs even more juice. This can range from trivial to completely impossible depending on the algorithm. In JS land, you need to make sure you operate on SharedAreayBufferrs when doing multithreading for performance, because web workers copy the input/output values by default.

Anywhoo.. maybe that helps.

When I try to optimize something lightly, it's not uncommon for me to get 10x improvement fairly easily. When I optimize something to within an inch of it's life, I can sometimes get three or even four orders of magnitude faster.

jesse__1y ago

EDIT: I forgot to mention that for tight performance, avoid branches. This means ifs, switchs, loops, goto, etc. Sometimes you need branches, but mispredicted branches can be extremely costly, causing pipeline stalls and flushes. This is why using a static loop is important; so the compiler can unroll it and not use a branch.

I also should mention that I hate flamegraphs. They only give you a bare minimum amount of information for doing performance work. I'm not sure of a good JS profiler, but what you want to be able to do is mark up the sections of code you want profiled, instead of the profiler taking random samples and squashing them all together. Look at the tracy profiler for an example

strken1y ago

> Web applications are probably always memory bound

IO bound and particularly network bound code is common too. The first fix I'd try with network bound code is to either eliminate the network call (local cache? turn a microservice into a library?) or to batch operations.

> Last step is to multithread it if it needs even more juice

In web app land, this is fraught with peril if you're doing it on the server, because it means your code is now competing for n times the resources. Often it's better for one request to take a long time if it means it's using a more predictable amount of memory, not causing other requests to time out, not exhausting your DB connection pool, etc.

I imagine that systems programming is similar in some ways and that's why multithreading is the last resort, just mentioning it because it's easy to shoot oneself in the foot with parallelism.

1 more reply

gjtorikian1y ago

You may also be interested in https://github.com/gjtorikian/html-pipeline (or its main dependency, https://github.com/gjtorikian/selma), for high performance HTML manipulation.

mewpmewp21y ago

I do wonder if the refactor will actually be better. The node by names seems like a scarier exceptional case that forces you down the road compared to when or whatever case being more straightforward. I think the OOP would work better if each node handler defined their own matcher.

wutwutwat1y ago

Is campsite related at all to 37 signals or basecamp? I feel like it being a ruby shop and being named campsite is a little uninspired if not.

andrewstuart1y ago

The waterfall of end statements in Ruby reminds me of pascal. Seems verbose.

pansa21y ago

Nobody ever likes my suggestion to write them all on one line. I thought it was neat - the length of the word `end` makes it line up perfectly with 4-space indentation:

    class HtmlTransform
        class Code < Base
            def markdown
              "`#{node.text}`"
    end end end

Etheryte1y ago

I think it's pretty easy to see why people would dislike it, with each on their own line and indented, it's very easy to track what ends where. With this version, not so much, if you're e.g. five nests deep and then see three end statements on one line.

1 more reply

rco87861y ago

Interesting. I consider myself a rubyist and never considered this. Perhaps because the rest of the language is so concise that this little verbosity never really bothered me

cess111y ago

"Use the index, Luke".

scotty791y ago

Use the hashmap.

Also don't replace string comparison with CSS selector search and expect it to be fast.

rand0mstring1y ago

by deleting it

jonstewart1y ago

I don't understand "how we made X in Ruby/Python Y% faster" posts. It is of course possible to optimize functions in any language, and often worthwhile to do, but if you're going to spend a lot of engineering resources on it, then can I introduce you to my friends C++ and Rust?

zahlman1y ago

Part of the point of using "friendlier" languages is to make it easier (or perhaps feasible) to express better algorithms. Porting existing code to a very different language is not fun, either. No matter how many "engineering resources" you're planning to spend, there are more and less efficient ways of doing so. Very often, "use C or Rust" (or other such languages) is simply not one of the more efficient ways. If it consistently were, other languages wouldn't be nearly as popular as they are.

liontwist1y ago

Expressing a better algorithm, or getting optimizations for free is clear win.

What’s not clear is why these languages should expose more low level performance tuning - such as multithreaded python. This removes invariants which make the language friendly, so that experts can squeeze out a constant speed up factor which is already an order of magnitude off from the proper tool.

knowitnone1y ago

I agree

j / k navigate · click thread line to collapse

45 comments

chikere2321y ago

bastawhiz1y ago

They said they made the change because the code was starting to become hard to maintain. That's not a terrible reason for refactoring.

notjoemama1y ago

I think they were referring to the degree of speed up.

1 more reply

benmmurphy1y ago

wutwutwat1y ago

Alifatisk1y ago

That's a huge improvement but damn, the fixed code didn't look any better in my eyes.

Going from

  HANDLERS = [
    Text,
    List,
    ListItem,
    Code,
    # ...
  ].freeze

  HANDLERS_BY_NODE_NAMES = [
    Text,
    List,
    ListItem,
    Code,
    # ...
  ].each_with_object({}) do |handler, result|
    handler::NODE_NAMES.each { |node_name| result[node_name] = handler }
  end.freeze

viraptor1y ago

I'd go with this since it's not performance critical code. Not sure if it's that more readable, but I like it better:

    BY_NODE_NAMES = HANDLERS.map {|h|
      h::NODE_NAMES.map {|n| [n, h]}
    }.flatten(1).to_h

ngcazz1y ago

I think we can go harder on the std lib :)

  HANDLERS.flat_map { _1.node_names.index_with(_1) }.inject(&:merge)

(nb: assuming there exists a `.node_names` to expose the constant... just because I like always using method calls)

masklinn1y ago

A bigger question for me would be why the handlers don’t register themselves. It should be a very small amount of meta-programmation, and would avoid having to repeat the handlers to register them.

IshKebab1y ago

2 more replies

hartator1y ago

You can also switch to Nokolexbor, our drop-in replacement for Nokogiri: https://github.com/serpapi/nokolexbor

It should almost 1,000 faster for this kind of CSS lookups.

kevmo3141y ago

I'm curious how the case/when version performs. Unlike the author, I don't think that is any smellier than the list/map solution they've come up with.

chikere2321y ago

"we improved performance with a simple trick! (rollback our changes)"

yayoohooyahoo1y ago

The title implies they fixed a method in Ruby itself which would have been a lot more interesting than this article.

cluckindan1y ago

This should have been obvious before the fact to anyone who understands how CSS selectors work in browsers.

That said, the traversing shouldn’t happen for plain tag selectors like ”h1”. There must be something wrong with the library they used.

rco87861y ago

> This should have been obvious before the fact to anyone who understands how CSS selectors work in browsers.

For sure, but that's somewhat reductive. This is not exactly common knowledge. Certainly not something you would expect any given engineer to have immediately jump to mind.

the_other1y ago

Is that really how it works in browsers and other rendering engines?

Intuition suggests to me that it wouldn’t start with CSS and then find all the matching DOM nodes. I would expect it started at each DOM node and then found the CSS rules which might apply.

esprehn1y ago

There's three different modes of running a selector in typical browsers:

  (a) Element#matches
  (b) Element#querySelector(All)
  (c) By the engine for updating style and layout

In TFA they're doing .matches() which I would expect to be slower than a hash map lookup, but for a simple selector like they're doing (just tag name) it shouldn't do much more then:

  (1) Parse the selector, hopefully cache that in an LRU
  (2) Execute the selector state machine against the element 
    (2.1) Compare tagName in the selector

Apparently Nokogiri implements CSS in a very inefficient way though by collecting ancestors and then converting the CSS into xpath and matching that:

https://github.com/sparklemotion/nokogiri/blob/e8d30a71d70b2...

I'd expect that to be an order of magnitude slower than what a browser does.

cluckindan1y ago

franciscop1y ago

- 20% of the times you get lucky and find a very easy win that speeds up things 90%+. Similar to this post, usually when a single method/call takes a huge chunk of the work.

- 50% of the times you grind at it and can get 30-50% speed up. I usually try many things, and only some of them do make a difference.

- 30% of the time absolutely no luck! Many small calls where each is unavoidable, no repeated code, etc.

swatcoder1y ago

Keep in mind that there are many layers of complex systems between your JS code and what'll end up happening on your system when it's run.

jesse__1y ago

I do low level systems programming, so pretty different from JS-land, but I feel the techniques you should apply when doing optimization generally apply at any level/language.

0) algorithmic improvement. Obvious shit like do a quick sort instead of bubble sort (assuming N > 64, or whatever), not doing unnecessary work in a hot loop, etc

3) probably unavailable (unless you want to/can drop into WASM), but the next step is usually SIMD. This is a rabbit hole, but if you want/need another ~8x perf improvement, this is how to get it

Anywhoo.. maybe that helps.

jesse__1y ago

strken1y ago

> Web applications are probably always memory bound

> Last step is to multithread it if it needs even more juice

I imagine that systems programming is similar in some ways and that's why multithreading is the last resort, just mentioning it because it's easy to shoot oneself in the foot with parallelism.

1 more reply

gjtorikian1y ago

You may also be interested in https://github.com/gjtorikian/html-pipeline (or its main dependency, https://github.com/gjtorikian/selma), for high performance HTML manipulation.

mewpmewp21y ago

wutwutwat1y ago

Is campsite related at all to 37 signals or basecamp? I feel like it being a ruby shop and being named campsite is a little uninspired if not.

andrewstuart1y ago

The waterfall of end statements in Ruby reminds me of pascal. Seems verbose.

pansa21y ago

Nobody ever likes my suggestion to write them all on one line. I thought it was neat - the length of the word `end` makes it line up perfectly with 4-space indentation:

    class HtmlTransform
        class Code < Base
            def markdown
              "`#{node.text}`"
    end end end

Etheryte1y ago

1 more reply

rco87861y ago

Interesting. I consider myself a rubyist and never considered this. Perhaps because the rest of the language is so concise that this little verbosity never really bothered me

cess111y ago

"Use the index, Luke".

scotty791y ago

Use the hashmap.

Also don't replace string comparison with CSS selector search and expect it to be fast.

rand0mstring1y ago

by deleting it

jonstewart1y ago

zahlman1y ago

liontwist1y ago

Expressing a better algorithm, or getting optimizations for free is clear win.

knowitnone1y ago

I agree

j / k navigate · click thread line to collapse