I’ve been doing Ruby for a decade and the matches? method immediately stood out to me. Methods doing any kind of comparison/regex or looping usually end up causing problems. Nokigiri was walking a tree of child nodes and doing a string comparison. I’ve seen people write themselves similar problems with methods like any? includes?, excludes?, etc. Par for the course.
Going from
HANDLERS = [
Text,
List,
ListItem,
Code,
# ...
].freeze
to HANDLERS_BY_NODE_NAMES = [
Text,
List,
ListItem,
Code,
# ...
].each_with_object({}) do |handler, result|
handler::NODE_NAMES.each { |node_name| result[node_name] = handler }
end.freeze BY_NODE_NAMES = HANDLERS.map {|h|
h::NODE_NAMES.map {|n| [n, h]}
}.flatten(1).to_h HANDLERS.flat_map { _1.node_names.index_with(_1) }.inject(&:merge)
(nb: assuming there exists a `.node_names` to expose the constant... just because I like always using method calls)It should almost 1,000 faster for this kind of CSS lookups.
As in, they are matched right-to-left, which implies that a selector like ”p a” first selects all the <a> nodes, and for each of them, it then traverses up the DOM tree until it encounters a <p> node (selector matches) or the root node (selector doesn’t match).
That said, the traversing shouldn’t happen for plain tag selectors like ”h1”. There must be something wrong with the library they used.
For sure, but that's somewhat reductive. This is not exactly common knowledge. Certainly not something you would expect any given engineer to have immediately jump to mind.
Intuition suggests to me that it wouldn’t start with CSS and then find all the matching DOM nodes. I would expect it started at each DOM node and then found the CSS rules which might apply.
So “I’m adding an A to the tree; what are all the CSS rules with or A or * at the rightmost token; which of that set applies to my current A; apply the rules in that sub set”. Going depth first into the DOM like this should result in skipping redundant CSS, and (as my imagination draws it) reduce DOM traversals.
(a) Element#matches
(b) Element#querySelector(All)
(c) By the engine for updating style and layout
The GP seems to be talking about (b), but even then browsers are checking each element one by one not advancing through the selector state machine in parallel for every element. (There's one exception in the old Cobalt which did advance the state machines IIRC).(a) and (c) are conceptually very similar except that when doing (c) you're checking many elements at the same time so browsers will do extra upfront costs like filling bloom filters for ancestors or index maps for nth-child.
In TFA they're doing .matches() which I would expect to be slower than a hash map lookup, but for a simple selector like they're doing (just tag name) it shouldn't do much more then:
(1) Parse the selector, hopefully cache that in an LRU
(2) Execute the selector state machine against the element
(2.1) Compare tagName in the selector
Apparently Nokogiri implements CSS in a very inefficient way though by collecting ancestors and then converting the CSS into xpath and matching that:https://github.com/sparklemotion/nokogiri/blob/e8d30a71d70b2...
https://github.com/sparklemotion/nokogiri/blob/e8d30a71d70b2...
I'd expect that to be an order of magnitude slower than what a browser does.
- 20% of the times you get lucky and find a very easy win that speeds up things 90%+. Similar to this post, usually when a single method/call takes a huge chunk of the work.
- 50% of the times you grind at it and can get 30-50% speed up. I usually try many things, and only some of them do make a difference.
- 30% of the time absolutely no luck! Many small calls where each is unavoidable, no repeated code, etc.
The code defines what the state should look like after its done executing. It expresses your intent. But that code gets transformed several times on the way to being executed and then the hardware can apply mang different possible approaches to executing it when the time comes.
Moreso every year, many of those software transformations, as well as the hardware's execution technique, are quite aggressive about revisiting your program's intent with optimizations (of some kind) that make sense within that context.
The upshot is that the farther you are from your hardware, the more of these layers there are between your code and its execution, the less influence and insight you have over what actually "physically" happens during execution.
When it comes to profiling and optimization of high-level programs like those written in Javascript, this means that it can ve somewhere between hard and impossible to predict how your code changes will actually impact performance.
Radical algorithm redesign can often yield salient diffferences that feel largely predictable, but smaller "precision" changes are often going to be a crap shoot. All those layers between you and the hardware were making optimzations already anyway, and your "precision" change may just as easily confound those existing optimizations as well as it might trigger some other. The results are tricky.
This is even true in lower-level code, where we're encouraged to do things like inspect compiler output on godbolt or in our compilation output and always confirm our expectations with a profiler (which often proves our guesses wrong). But it's all that much more pronounced in high-level ones.
So ultimately, yes, assuming your prevailing algorithms are generally optimal, profiling and optimization is almost always going to feel like a guess-and-test process. But that's okay, because you can test and those tests are usually (not always) telling you if you've made a meaningful difference or not.
0) algorithmic improvement. Obvious shit like do a quick sort instead of bubble sort (assuming N > 64, or whatever), not doing unnecessary work in a hot loop, etc
1) reduce memory footprint. The slowest part of your program is almost always just waiting for memory, unless you're doing something that's heavily CPU bound. Web applications are probably always memory bound. Reducing the amount of memory the function you're optimizing operates on reduces DCache misses, which are expensive.
2) Do batch operations. Once I've got something to a point where it's not completely braindead (which, honestly, is where I stop most of the time), I look to start batching things. Usually look to do 8 or 16 at a time in the hopes the compiler/runtime can make some use of SIMD. Use STATIC LOOPS ie (for 0..8) so the compiler can unroll the loop. That's extremely important.
3) probably unavailable (unless you want to/can drop into WASM), but the next step is usually SIMD. This is a rabbit hole, but if you want/need another ~8x perf improvement, this is how to get it
4) once all that's done, it's probably close to optimal in terms of cycles per element (unless I did something boneheaded, which is common). Last step is to multithread it if it needs even more juice. This can range from trivial to completely impossible depending on the algorithm. In JS land, you need to make sure you operate on SharedAreayBufferrs when doing multithreading for performance, because web workers copy the input/output values by default.
Anywhoo.. maybe that helps.
When I try to optimize something lightly, it's not uncommon for me to get 10x improvement fairly easily. When I optimize something to within an inch of it's life, I can sometimes get three or even four orders of magnitude faster.
I also should mention that I hate flamegraphs. They only give you a bare minimum amount of information for doing performance work. I'm not sure of a good JS profiler, but what you want to be able to do is mark up the sections of code you want profiled, instead of the profiler taking random samples and squashing them all together. Look at the tracy profiler for an example
IO bound and particularly network bound code is common too. The first fix I'd try with network bound code is to either eliminate the network call (local cache? turn a microservice into a library?) or to batch operations.
> Last step is to multithread it if it needs even more juice
In web app land, this is fraught with peril if you're doing it on the server, because it means your code is now competing for n times the resources. Often it's better for one request to take a long time if it means it's using a more predictable amount of memory, not causing other requests to time out, not exhausting your DB connection pool, etc.
I imagine that systems programming is similar in some ways and that's why multithreading is the last resort, just mentioning it because it's easy to shoot oneself in the foot with parallelism.
class HtmlTransform
class Code < Base
def markdown
"`#{node.text}`"
end end endAlso don't replace string comparison with CSS selector search and expect it to be fast.
What’s not clear is why these languages should expose more low level performance tuning - such as multithreaded python. This removes invariants which make the language friendly, so that experts can squeeze out a constant speed up factor which is already an order of magnitude off from the proper tool.