undefined | Better HN

0 pointsburntsushi9y ago0 comments

Did you know Rust was quite a bit faster than C in regexdna merely a few months ago? It didn't get slower because of Rust. The algorithms employed are radically different. My hope is that the regex library has already regained performance, but until the benchmark game is updated (which is on us, not the benchmark game maintainer), I suppose we'll have to suffer the pedants!

Or perhaps, you might look at single threaded performance and wonder, maybe there is something more interesting going on than a naive surface analysis of C vs. Rust! :-) https://benchmarksgame.alioth.debian.org/u64/rust.php

And by the way, transliterating a regex library isn't trivial. I invite you to transliterate Tcl's regex library. Let me know how that goes. ;-) So I think your reasoning is specious at best.

0 comments

igouy9y ago

> It didn't get slower because of Rust.

Do you mean the program became relatively slower because of changes you've made to the regex crate?

Wasn't the program relatively faster because you wrote the regex crate to use Aho-Corasick for the matches required by the regex-dna task?

burntsushiOP9y ago

> Do you mean the program became relatively slower because of changes you've made to the regex crate?

Yes. The underlying reasoning is complex. When the regex crate got a lazy DFA (similar to the one used by RE2), the vast majority of regexes got significantly faster. Some got slower. This one in particular from regex-dna:

    >[^\n]*\n|\n

Before the lazy DFA, compile time analysis would notice that all matches either start with `>` or `\n` and do a fast prefix scan for them. Each match of `>` or `\n` represents a candidate for a match. Candidates were then verified using something similar to the Thompson NFA, which is generally pretty slow, but the prefix scanning reduced the amount of work required considerably.

Once the lazy DFA was added, the prefix scanning was still used, but the lazy DFA was used to verify candidates. It's faster in general by a lot, but, the lazy DFA requires two scans of the candidate: one to find the end position and another to find the start position. That extra scan made processing this regex (on the regex-dna input) slightly slower.

I've since fixed some of this by reducing a lot of the match overhead of the lazy DFA, so my hope is that it's back to par, but I haven't done any rigorous benchmarking to verify that.

> Wasn't the program relatively faster because you wrote the regex crate to use Aho-Corasick for the matches required by the regex-dna task?

Aho-Corasick is principally useful for the second phase of regex-dna, e.g., the regexes that look like `ag[act]gtaaa|tttac[agt]ct`. (In the last phase, all the regexes are just single byte literals, so neither Aho-Corasick nor the regex engine should ever be used.) Performance here should stay the same.

On that note, I have a new nightly-only algorithm called Teddy that uses SIMD[1] (which replaces the use of Aho-Corasick for those regexes) and is a bit faster. I got the algorithm from the Hyperscan[2] project, which also does extensive literal analysis to speed up regexes.

To clarify, this optimization is generally useful because a lot of regexes in the wild have prefix literals. Even something like `(?i:foo)\s+bar` can benefit from it, since `(?i:foo)` expands to FOO, FOo, FoO, Foo, fOO, fOo, foO, foo, which can of course be used with Aho-Corasick (and also my new SIMD algorithm).

One also must wonder how well a C program using PCRE2's JIT would fair on the benchmarks game. From my experience, it would probably be near the top. It's quite fast!

[1] - https://github.com/rust-lang-nursery/regex/blob/master/src/s...

[2] - https://github.com/01org/hyperscan

igouy9y ago

> One also must wonder how well a C program using PCRE2's JIT would fair on the benchmarks game.

Let's hope some C and C++ programmers take up the challenge ;-)

igouy9y ago

Please don't point people to u64 -- it's no longer updated. (Note the rustc version.)

j / k navigate · click thread line to collapse

0 comments

igouy9y ago

> It didn't get slower because of Rust.

Do you mean the program became relatively slower because of changes you've made to the regex crate?

Wasn't the program relatively faster because you wrote the regex crate to use Aho-Corasick for the matches required by the regex-dna task?

burntsushiOP9y ago

> Do you mean the program became relatively slower because of changes you've made to the regex crate?

    >[^\n]*\n|\n

I've since fixed some of this by reducing a lot of the match overhead of the lazy DFA, so my hope is that it's back to par, but I haven't done any rigorous benchmarking to verify that.

> Wasn't the program relatively faster because you wrote the regex crate to use Aho-Corasick for the matches required by the regex-dna task?

One also must wonder how well a C program using PCRE2's JIT would fair on the benchmarks game. From my experience, it would probably be near the top. It's quite fast!

[1] - https://github.com/rust-lang-nursery/regex/blob/master/src/s...

[2] - https://github.com/01org/hyperscan

igouy9y ago

> One also must wonder how well a C program using PCRE2's JIT would fair on the benchmarks game.

Let's hope some C and C++ programmers take up the challenge ;-)

igouy9y ago

Please don't point people to u64 -- it's no longer updated. (Note the rustc version.)

j / k navigate · click thread line to collapse