undefined | Better HN

0 pointseru5y ago0 comments

> No, it doesn't. It's only theoretically easy to implement.

Oh, I didn't say anything about easy! I am on and off working on a Haskell re-implementation (but with GADTs and in Oleg's tagless final interpreter style etc, so it's more about exploring the type system).

> In practice, they explode the size of the underlying FSM.

You may be right, but that's still better than the gymnastics you'd have to do by hand to get the same features out of a 'normal' regex.

> Moreover, in a command line tool, it's somewhat easy to work around that through the `-v` switch and shell pipelining.

Alas, that only works, if your intersection or complement happen at the top level. You can't do something like

(A & not B) followed by (C & D)

that way.

> Paul's talk introduced redgrep is amazing by the way. Give it a watch if you haven't yet: https://www.youtube.com/watch?v=Ukqb6nMjFyk

I have, and I agree!

Perhaps I'll try and implement a basic version of redgrep in Rust as an exercise. (I just want something that supports basically all the operations regular languages are closed, but don't care too much about speed, as long as the runtime complexity is linear.)

0 comments

burntsushi5y ago

Yeah sorry, I've gotten asked this question a lot. The issue is that building a production grade regex engine---even when it's restricted to regular languages---requires a lot more engineering than theory. And these particular features just don't really pull their weight IMO. They are performance footguns, and IMO, are also tricky to reason about inside of regex syntax.

If you get something working, I'd love to look at it though! Especially if you're building in a tagless final interpreter style. I find that approach extremely elegant.

eruOP5y ago

For my current attempts, I bit off more than I could chew:

I tried to build a system that not only recognizes regular languages, but also serves as a parser for them (a la Parsec).

The latter approach pushes you to support something like fmap, but the whole derivatives-based approach needs more 'introspection' so support general mapping via fmap (ie a->b) is out, and you can only support things that you have more control over than functions.

(And in general, I am doing bifunctors, because I want the complement of the complement be the original thing.)

Sorry, if that's a bit confused.. If I was a better theoretician, I could probably work it out.

I haven't touched the code in a while. But recently I have thought about the theory some more. The Brzozowski derivative introduced the concept of multiplicative inverse of a string. I am working out the ramifications of extending that to the multiplicative inverse of arbitrary regular expressions. (The results might already be in the literature. I haven't looked much.)

I don't expect anything groundbreaking to come out of that, but I hope my understanding will improve.

> And these particular features just don't really pull their weight IMO. They are performance footguns, and IMO, are also tricky to reason about inside of regex syntax.

Well, in theory I could 'just' write a preprocessor that takes my regex with intersection and complement and translates it to a more traditional one. I wouldn't care too much if that's not very efficient.

I'm interested in those features because of the beauty of the theory, but it would also help make production regular expressions more modular.

Eg if you have a regular expression to decide on what's a valid username for someone to sign up to your system. You decide to use email addresses as your usernames, so the main qualification is that users can receive an email on it. But because they will be visible to other users, you have some additional requirements:

'.{0,100} & [^@]@[^@] & not (.(root|admin|<some offensive term>).@.) & not (.<sql injection>.*)'

That's a silly example. I think in production, I would be more likely to see something as complicated as this in eg some ad-hoc log parsing.

> The issue is that building a production grade regex engine---even when it's restricted to regular languages---requires a lot more engineering than theory.

Amen to that!

j / k navigate · click thread line to collapse