I played with making a regex library in rust. Which, as per RE2 design involves constructing graphs and glueing them together as the regex is traversed
This requires a cycle catching gc, or, just a preallocated arena... It was my first foray into rust and felt I would need to be hitting into unsafe, which I wasn't ready for. Array indexing might decompose into an arena, but syntactically just a bit messier (imho)
Would be interesting to see how the RE2 does it in rust (didn't know that)
I like how the article shows both sides of the fence, it makes me realize:
I get a lot of optimizations from ptr stuffing in c. But sometimes we should lay down the good, for the better
For reference, I am also the author of the regex crate. The only unsafe it uses specific to finite automata is to do explicit elimination of bounds checks in the core hybrid NFA/DFA loop.
As an old C programmer, the difference between an array index and a pointer caught me by surprise. In C a pointer is just an unchecked offset into memory. A real array index is just a unchecked offset into ... maybe a smaller chunk of raw memory.
But in rust, an array index is something that comes with additional bounds checking overheads with every use. And the memory it points to is also constrained - the entire array has to be initialised, so if the index passes the bounds check you are guaranteed rusts memory consistency invariants are preserved. Indexes also allow you to escape the borrow checker. If you own the slice, there is no need to prove you can access an element of the slice.
So yeah, you can use indexes instead of pointers, but for rust that's like saying you can use recursion instead of iteration. Indexing and pointers are two very different things in rust.
If bounds checks prove to be a problem, you can explicitly elide them. Indeed, Rust's regex does just that. :-)
You could instead go with a derivatives approach.
About large character classes: how are those harder than in approaches? If you build any FSM you have to deal with those, don't you?
One way to handle them that works well when the characters in your classes are mostly next to each other unicode, is to express your state transition function as an 'interval map'
What I mean is that eg a hash table or an array lets you build representations of mathematical functions that map points to values.
You want something that can model a step function.
You can either roll your own, or write something around a sorted-map data structure.
Eg in C++ you'd base the whole thing around https://en.cppreference.com/w/cpp/container/map/upper_bound (or https://hackage.haskell.org/package/containers-0.4.0.0/docs/... in Haskell.)
The keys in your sorted map are the 'edges' of your characters classes (eg where they start and end).
Does that make sense? Or am I misunderstanding the problem?
> I personally always get stuck at how to handle things like captures [...]
Let me think about that one for a while. Some Googling suggests https://github.com/elfsternberg/barre but they don't seem to support intersection, complement or stripping prefixes.
What do you want your capture groups to do? Do you eg just want to return pointers to where you captured them (if any)?
I have an inkling that something inspired by https://en.wikipedia.org/wiki/Viterbi_algorithm might work.
https://github.com/google/redgrep/blob/main/parser.yy mentions something about capture, but not sure if that has anything to do with capture groups.