undefined | Better HN

0 pointsburntsushi5y ago0 comments

To be clear, ripgrep has no runtime dependency on any LLVM or C++ library. rustc does.

0 comments

The interesting thing here is that rust has good threading and fantastic crates

I played with making a regex library in rust. Which, as per RE2 design involves constructing graphs and glueing them together as the regex is traversed

This requires a cycle catching gc, or, just a preallocated arena... It was my first foray into rust and felt I would need to be hitting into unsafe, which I wasn't ready for. Array indexing might decompose into an arena, but syntactically just a bit messier (imho)

Would be interesting to see how the RE2 does it in rust (didn't know that)

I like how the article shows both sides of the fence, it makes me realize:

I get a lot of optimizations from ptr stuffing in c. But sometimes we should lay down the good, for the better

burntsushiOP5y ago

You're overcomplicating it. When it comes to finite state machines at least, it's very easy to use an ID index instead of the raw pointer itself. That's exactly what the regex crate does.

For reference, I am also the author of the regex crate. The only unsafe it uses specific to finite automata is to do explicit elimination of bounds checks in the core hybrid NFA/DFA loop.

rstuart41335y ago

> When it comes to finite state machines at least, it's very easy to use an ID index instead of the raw pointer itself.

As an old C programmer, the difference between an array index and a pointer caught me by surprise. In C a pointer is just an unchecked offset into memory. A real array index is just a unchecked offset into ... maybe a smaller chunk of raw memory.

But in rust, an array index is something that comes with additional bounds checking overheads with every use. And the memory it points to is also constrained - the entire array has to be initialised, so if the index passes the bounds check you are guaranteed rusts memory consistency invariants are preserved. Indexes also allow you to escape the borrow checker. If you own the slice, there is no need to prove you can access an element of the slice.

So yeah, you can use indexes instead of pointers, but for rust that's like saying you can use recursion instead of iteration. Indexing and pointers are two very different things in rust.

burntsushiOP5y ago

I guess so. But note that I didn't equate them. I just said that you can use an ID index instead. For the particular program of FSMs, they work very well.

If bounds checks prove to be a problem, you can explicitly elide them. Indeed, Rust's regex does just that. :-)

eru5y ago

> I played with making a regex library in rust. Which, as per RE2 design involves constructing graphs and glueing them together as the regex is traversed

You could instead go with a derivatives approach.

https://en.wikipedia.org/wiki/Brzozowski_derivative

burntsushiOP5y ago

Has anyone built a production grade regex engine using derivatives? I don't think I've seen one. I personally always get stuck at how to handle things like captures or the very large Unicode character classes. Or hacking in look-around. (It's been a while since I've given this thought though, so I'm not sure I'll be able to elaborate much.)

eru5y ago

I've made some attempts, but nothing production grade.

About large character classes: how are those harder than in approaches? If you build any FSM you have to deal with those, don't you?

One way to handle them that works well when the characters in your classes are mostly next to each other unicode, is to express your state transition function as an 'interval map'

What I mean is that eg a hash table or an array lets you build representations of mathematical functions that map points to values.

You want something that can model a step function.

You can either roll your own, or write something around a sorted-map data structure.

Eg in C++ you'd base the whole thing around https://en.cppreference.com/w/cpp/container/map/upper_bound (or https://hackage.haskell.org/package/containers-0.4.0.0/docs/... in Haskell.)

The keys in your sorted map are the 'edges' of your characters classes (eg where they start and end).

Does that make sense? Or am I misunderstanding the problem?

> I personally always get stuck at how to handle things like captures [...]

Let me think about that one for a while. Some Googling suggests https://github.com/elfsternberg/barre but they don't seem to support intersection, complement or stripping prefixes.

What do you want your capture groups to do? Do you eg just want to return pointers to where you captured them (if any)?

I have an inkling that something inspired by https://en.wikipedia.org/wiki/Viterbi_algorithm might work.

https://github.com/google/redgrep/blob/main/parser.yy mentions something about capture, but not sure if that has anything to do with capture groups.

1 more reply

j / k navigate · click thread line to collapse

0 comments

tsegratis5y ago

The interesting thing here is that rust has good threading and fantastic crates

I played with making a regex library in rust. Which, as per RE2 design involves constructing graphs and glueing them together as the regex is traversed

Would be interesting to see how the RE2 does it in rust (didn't know that)

I like how the article shows both sides of the fence, it makes me realize:

I get a lot of optimizations from ptr stuffing in c. But sometimes we should lay down the good, for the better

burntsushiOP5y ago

You're overcomplicating it. When it comes to finite state machines at least, it's very easy to use an ID index instead of the raw pointer itself. That's exactly what the regex crate does.

For reference, I am also the author of the regex crate. The only unsafe it uses specific to finite automata is to do explicit elimination of bounds checks in the core hybrid NFA/DFA loop.

rstuart41335y ago

> When it comes to finite state machines at least, it's very easy to use an ID index instead of the raw pointer itself.

So yeah, you can use indexes instead of pointers, but for rust that's like saying you can use recursion instead of iteration. Indexing and pointers are two very different things in rust.

burntsushiOP5y ago

I guess so. But note that I didn't equate them. I just said that you can use an ID index instead. For the particular program of FSMs, they work very well.

If bounds checks prove to be a problem, you can explicitly elide them. Indeed, Rust's regex does just that. :-)

eru5y ago

> I played with making a regex library in rust. Which, as per RE2 design involves constructing graphs and glueing them together as the regex is traversed

You could instead go with a derivatives approach.

https://en.wikipedia.org/wiki/Brzozowski_derivative

burntsushiOP5y ago

eru5y ago

I've made some attempts, but nothing production grade.

About large character classes: how are those harder than in approaches? If you build any FSM you have to deal with those, don't you?

One way to handle them that works well when the characters in your classes are mostly next to each other unicode, is to express your state transition function as an 'interval map'

What I mean is that eg a hash table or an array lets you build representations of mathematical functions that map points to values.

You want something that can model a step function.

You can either roll your own, or write something around a sorted-map data structure.

Eg in C++ you'd base the whole thing around https://en.cppreference.com/w/cpp/container/map/upper_bound (or https://hackage.haskell.org/package/containers-0.4.0.0/docs/... in Haskell.)

The keys in your sorted map are the 'edges' of your characters classes (eg where they start and end).

Does that make sense? Or am I misunderstanding the problem?

> I personally always get stuck at how to handle things like captures [...]

Let me think about that one for a while. Some Googling suggests https://github.com/elfsternberg/barre but they don't seem to support intersection, complement or stripping prefixes.

What do you want your capture groups to do? Do you eg just want to return pointers to where you captured them (if any)?

I have an inkling that something inspired by https://en.wikipedia.org/wiki/Viterbi_algorithm might work.

https://github.com/google/redgrep/blob/main/parser.yy mentions something about capture, but not sure if that has anything to do with capture groups.

1 more reply

j / k navigate · click thread line to collapse