Calculate the difference and intersection of any two regexes (opens in new tab)

(phylactery.org)

353 pointsposco2y ago117 comments

117 comments

I created a similar regex web demo that shows how a regex is parsed -> NFA -> DFA -> minimal DFA, and finally outputs LLVMIR/Javascript/WebAssembly for from the minimal DFA:

http://compiler.org/reason-re-nfa/src/index.html

eru2y ago

Though going from NFA to explicit DFA isn't always a good idea.

Btw, you might also like looking into the Brzozowski derivative https://en.wikipedia.org/wiki/Brzozowski_derivative which can be used as an alternative way to match regular expressions.

alphablended2y ago

I think it is also worth mentioning that the site linked at the top uses the antimirov extension to brzozovzki work on regex deivatives.

1 more reply

mikhailfranco2y ago

You could implement the NFA directly with concurrent exploration of all paths:

https://github.com/mike-french/myrex

oever2y ago

This library can be used to create string class hierarchies. That, in turn, can help to use typed strings more.

For example, e-mails and urls are a special syntax. Their value space is a subset of all non-empty string which is a subset of all strings.

An e-mail address could be passed into a function that requires a non-empty string as input. When the type-system knows that an e-mail string is a subclass of non-empty string, it knows that an email address is valid.

This library can be used to check the definitions and hierarchy of such string types. The implementation of the hierarchy differs per programming language (subclassing, trait boundaries, etc).

1-more2y ago

In languages with tagged union types you do this a lot! Some Haskell pseudocode for ya

    module Email (Address, fromText, toText) where -- note we do not export the constructor of Address, just the type

    data Address = Address Text

    fromString :: Text -> Maybe Address
    fromString =
        -- you'd do your validation in here and return Nothing if it's a bad address.
        -- Signal validity out of band, not in band with the data.

    toText :: Address -> Text
    toText (Address addr) = addr -- for when you need to output it somewhere

bradrn2y ago

Pedantic note: ‘Address’ should really be a ‘newtype’…

1 more reply

alexeldeib2y ago

> Signal validity out of band, not in band with the data.

Could you expand on this?

1 more reply

croes2y ago

>An e-mail address could be passed into a function that requires a non-empty string as input. When the type-system knows that an e-mail string is a subclass of non-empty string, it knows that an email address is valid.

Don't use regex for email address validation

https://news.ycombinator.com/item?id=31092912

usrusr2y ago

Nothing like a dive into the wondrous world of what is and isn't allowed in an email address left of the @ on a warm late-summer morning. It's one of the mysteries of the modern world. The simple heuristic that proposes that every regex trying to express "valid email address" is wrong is a sufficiently safe bet, but it ruins all the fun.

1 more reply

_a_a_a_2y ago

> Their value space...

wossis mean? TIA

Edit: instread of downvoting try answering. I'd like to know. TIA{2}

umanwizard2y ago

People are downvoting you because quirky/jokey super-colloquial language like “wossis mean? TIA” is hard to understand, and also just doesn’t really mesh with the vibe of the site.

1 more reply

oever2y ago

Value space is the set of values a type can have. A boolean has only two values in its value space. An unsigned byte has 256 possible values, so does a signed byte.

A string enumeration has a limited number of values. E.g. type A ("Yes" | "No" | "Maybe") has three values and is a superset of type B ("Yes" | "No"). A function that accepts type A can also accept type B as valid input.

If the value space is defined by a regular expression, as is often the case, the mentioned library could be used to check, at compile-time, which type are subsets of others.

1 more reply

brianpan2y ago

If I hadn't seen your edit, I might have downvoted the comment for not being intelligible.

klysm2y ago

Regular expressions are a great example of bundling up some really neat and complex mathematical theory into a valuable interface. Linear algebra feels similar to me.

dhosek2y ago

It always amazes me how given the appropriate field, so much math can be transformed into linear algebra. Even Möbius transformations on the complex plane w=(az+b)/(cz+d) can be turned into linear algebra.

pishpash2y ago

Linear transformations preserve the structure of the space so you can keep applying them. It's not surprising that you can always find some "space-preserving" part of a problem and fold the rest (the "non-linear" structure) into transformations or the definition of the space itself.

1 more reply

pishpash2y ago

That usually means the representation is getting close to the truth. Good interfaces have intrinsic value, which many result-focused people do not appreciate.

abecedarius2y ago

iirc connections with linear algebra come up in Conway's https://store.doverpublications.com/0486485838.html (which I only skimmed).

Jaxan2y ago

There is a whole field of “weighted automata” which combine linear algebra and automata theory.

poscoOP2y ago

The amazing page computes binary relations between pairs of regular expressions and shows a graphical representation of the DFA.

It’s a really incredible demonstration of some highly non-trivial operations on regular expressions.

vintermann2y ago

It's very cool, but also no wonder that it doesn't support all those features of regexes which technically make them not regular expressions anymore. Though, I would have thought ^ and $ anchors shouldn't be a problem?

rntz2y ago

^ and $ are a problem, although one with a workaround.

The standard theory of regular expressions focuses entirely on regex matching, rather than searching. For matching, ^ and $ don't really mean anything. In particular, regexp theory is defined in terms of the "language of" a regexp: the set of strings which match it. What's the set of strings that "^" matches? Well, it's the empty string, but only if it comes at the beginning of a line (or sometimes the beginning of the document). This beginning-of-line constraint doesn't fit nicely into the "a regexp is defined by its language/set of strings" theory, much the same way lookahead/lookbehind assertions don't quite fit the theory of regular expressions.

The standard workaround is to augment your alphabet with special beginning/end-of-line characters (or beginning/end-of-document), and say that "^" matches the beginning-of-line character.

teraflop2y ago

This page implements regex matching, not searching. So in effect, every pattern has an implicit ^ at the beginning and $ at the end.

o11c2y ago

A lack of `^` is equivalent to prepending `(.*)`, then trimming the match span to the end of that capture. And similarly for a lack of `$` (but suddenly I remember how nasty Python was before `.fullmatch` was added ...).

More interesting is word boundaries:

`\b` is just `\<|\>` though that should be bubbled up and usually only one side will actually produce a matchable regex.

`A\<B` is just `(A&\W)(\w&B)`, and similar for `\>`.

1 more reply

abareplace2y ago

The double quote (") is also broken. If you use it in the regex, then no DFA is displayed.

Sharlin2y ago

As ^ and $ are implicit, you can opt out of them simply by affixing `.*`.

1 more reply

est2y ago

Ha, trying to paste "regex filter numbers divisible by 3" and the page froze to death https://stackoverflow.com/q/10992279/41948

    ^(?:[0369]+|[147](?:[0369]*[147][0369]*[258])*(?:[0369]*[258]|[0369]*[147][0369]*[147])|[258](?:[0369]*[258][0369]*[147])*(?:[0369]*[147]|[0369]*[258][0369]*[258]))+$

    ^([0369]|[147][0369]*[258]|(([258]|[147][0369]*[147])([0369]|[258][0369]*[147])*([147]|[258][0369]\*[258])))+$

I wonder if there's a shortest one.

abareplace2y ago

The web page hangs on the regular expressions that produce a DFA with a lot of states. For example, these ones:

(ab+c+)+

(abc){100}

a.*quick brown fox jumps over the lazy dog

zamadatix2y ago

The page says it doesn't support anchors anyway.

layer82y ago

I wanted to see the intersection between syntactically valid URLs and email addresses, but just entering the URL regex (cf. below) already takes too long to process for the page.

[\-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([\-a-zA-Z0-9()@:%_+.~#?&//=]*)

(source: https://stackoverflow.com/a/3809435/623763)

d662y ago

expressions like (...){1,256} are very heavyweight and the scala JS code ends up timing out or crashing the browser.

if you replace that with (...)+ then it seems to work (at least for me). smaller expressions like (...){1,6} should be fine.

noduerme2y ago

Just wondering, what is it about testing repetition [a-z]{1,256} with an upper bound that's so heavy? Intuitively it feels like greedy testing [a-z]+ should actually be worse since it has to work back from the end of the input.

2 more replies

jepler2y ago

This is neat!

I was surprised then not surprised that the union & intersection REs it comes up with are not particularly concise. For example the two expressions "y.+" and ".+z" have a very simple intersection: "y.*z" (equality verified by the page, assuming I haven't typo'd anything). But the tool gives

    yz([^z][^z]*z|z)*|y[^z](zz*[^z]|[^z])*zz*

instead. I think there are reasons it gives the answer it does, and giving a minimal (by RE length in characters or whatever) regular expression is probably a lot harder.

ufo2y ago

I think one of the reasons is the ".+z" gets bigger and uglier after you convert it to a deterministic automaton.

daveFNbuck2y ago

They show the DFA for it on the site, it's 3 states. There's a starting state for the first . and then two states that transition back and forth between whether z was the last character or not.

I think what's actually happening here is that they're doing the intersection on the DFAs and then producing a regex from the resulting DFA. The construction of a regex from a DFA is where things get ugly and weird.

rsstack2y ago

I used this concept once to write the validation logic for an "IP RegEx filter" setting. The goal was to let users configure an IP filter using RegEx (no, marketing people don't get CIDRs, and they knew RegEx's from Google Analytics). How could I define a valid RegEx for this? The intersection with the RegEx of "all IPv4 addresses" is not empty, and not equal to the RegEx of "all IPv4 addresses". Prevented many complaints about the filter not doing anything, but of course didn't prevent wrong filters from being entered.

Etheryte2y ago

Wouldn't a simpler solution work here? Instead of trying to validate the filter regex, show some sample IP addresses or let the user insert a set of addresses, and then show which ones the filter matches and which ones it doesn't. Also helps address the problem of incorrect filters.

rsstack2y ago

The odds of the sample addresses matching is essentially zero, and adding work to the user is counterproductive.

1 more reply

pimlottc2y ago

Suggestion: turn off auto suggest in the regex input fields to make it more usable on mobile.

https://stackoverflow.com/questions/35513968/disable-autocor...

x-complexity2y ago

I used 2 similar divide-by-3 regexes to test the page (after removing the ^ and $ to their ends), and it froze up:

Regex 1: ([0369]|([258]|[147][0369]*[147])([0369]|([147][0369]*[258]|[258][0369]*[147]))*([147]|[258][0369]*[258])|([147]|[258][0369]*[258])([0369]|([147][0369]*[258]|[258][0369]*[147]))*([258]|[147][0369]*[147]))*

Regex 2: ([0369]|[258][0369]*[147]|(([147]|[258][0369]*[258])([0369]|[147][0369]*[258])*([258]|[147][0369]*[147])))*

Everything up until the last '*' is parsable. The moment I put in the *, the entire page freezes up.

Without the *, it produced a valid verifier for parsing chunks of digits whose sum mod 3 = 0.

emmanueloga_2y ago

One possible application: If an input to a function parameter must match a certain regex, and the output of a function produces results matching another regex, we can know if the functions are compatible: if the intersection of regular expressions is empty, then you cannot connect one function to the other.

Combined with the fact the regular expressions can be used not only on strings but more generally (e.g. for JSON schema validation [1]), this could be a possible implementation of static checks, similar to "design by contract".

1: https://www.balisage.net/Proceedings/vol23/html/Holstege01/B...

baggy_trough2y ago

I love how it looks like a CS textbook.

perihelions2y ago

The graphics look identical to those in Hopcroft & Ullman's "Introduction to Automata Theory, Languages, and Computation" (like the convention that they use a double-circle to denote accepting states). I imagine they're GraphViz-based: it's very easy [0] to draw these in GraphViz. I don't know what Hopcroft & Ullman used though, because that one was published in 1979, and GraphViz didn't exist before 1991. Suddenly I'm curious what the state of the art for vector diagrams was in 1979...?

[0] e.g. https://graphviz.org/Gallery/directed/fsm.html

therealcamino2y ago

Maybe something related to 'pic'? This doc on it is a revised version of a 1984 edition, so maybe it's a little too late, but there are references to other systems back to 1977 or so.

https://pikchr.org/home/uv/pic.pdf

cobbal2y ago

It has the look of graphviz about it, which is an excellent tool. Often helpful in debugging anything related to graphs.

https://graphviz.org/

simlevesque2y ago

Kinda related but I'm looking for something that could give me the number of possible matching strings for a simple regex. Does such a tool exist ?

contravariant2y ago

I feel like it shouldn't be too hard to calculate from the finite automaton that encodes the regular expression, but surely in most cases it will simply be infinite?

tetha2y ago

This is hitting back a long time. But the algorithm - if I recall right - is a simple DFS on the determinstic automaton for the regular expression and it can output the full set of matching strings if you're allowed to use *s in the output.

Basically, you need an accumulator of "stuff up to here". If you move from a node to a second node, you add the character annotating that edge to the accumulator. And whenever you end up with an edge to a visited node, you add a '*' and output that, and for leaf nodes, you output the accumulator.

And then you add a silly jumble of parenthesis on entry and output to make it right. This was kinda simple to figure out with stuff like (a(ab)*b)* and such.

This is in O(states) for R and O(2^states) for NR if I recall right.

kadoban2y ago

Maybe the number of possible matchings for a given length (or range of lengths) might be interesting?

2 more replies

0823498723498722y ago

see https://www.cs.dartmouth.edu/~doug/nfa.pdf

d662y ago

the page actually does give these. for α := [a-z]{2,4} the page gives |α| = 475228.

however, as others have pointed out any non-trivial use of the kleene star means the result will be ∞. in this case the page will list numbers that roughly correspond to "number of strings with N applications of kleene star" in addition to infinity.

rntz2y ago

Here's a simple Haskell program to do it:

(EDIT: this code is completely wrongheaded and does not work; it assumes that when sequencing regexes, you can take the product of their sizes to find the overall size. This is just not true. See reply, below, for an example.)

    -- https://gist.github.com/rntz/03604e36888a8c6f08bb5e8c665ba9d0

    import qualified Data.List as List

    data Regex = Class [Char]   -- character class
               | Seq [Regex]    -- sequence, ABC
               | Choice [Regex] -- choice, A|B|C
               | Star Regex     -- zero or more, A*
                 deriving (Show)

    data Size = Finite Int | Infinite deriving (Show, Eq)

    instance Num Size where
      abs = undefined; signum = undefined; negate = undefined -- unnecessary
      fromInteger = Finite . fromInteger
      Finite x + Finite y = Finite (x + y)
      _ + _ = Infinite
      Finite x * Finite y = Finite (x * y)
      x * y = if x == 0 || y == 0 then 0 else Infinite

    -- computes size & language (list of matching strings, if regex is finite)
    eval :: Regex -> (Size, [String])
    eval (Class chars) = (Finite (length cset), [[c] | c <- cset])
      where cset = List.nub chars
    eval (Seq regexes) = (product sizes, concat <$> sequence langs)
      where (sizes, langs) = unzip $ map eval regexes
    eval (Choice regexes) = (size, lang)
      where (sizes, langs) = unzip $ map eval regexes
            lang = concat langs
            size = if elem Infinite sizes then Infinite
                   -- finite, so just count 'em. inefficient but works.
                   else Finite (length (List.nub lang))
    eval (Star r) = (size, lang)
      where (rsize, rlang) = eval r
            size | rsize == 0 = 1
                 | rsize == 1 && List.nub rlang == [""] = 1
                 | otherwise = Infinite
            lang = [""] ++ ((++) <$> [x | x <- rlang, x /= ""] <*> lang)

    size :: Regex -> Size
    size = fst . eval

NB. Besides the utter wrong-headedness of the `product` call, the generated string-sets may not be exhaustive for infinite languages, and the original version (I have since edited it) was wrong in several cases for Star (if the argument was nullable or empty).

sebzim45002y ago

Surely that fails for e.g. a?a?a?. I'd imagine you could do some sort of simplification first though to avoid this redundancy.

1 more reply

mikhailfranco2y ago

Another interesting question is: how many possible successful matches are there for a given input string. For example:

How many ways can (a?){m}(a*){m} match the string a{m}

i.e. input m repetitions of the letter 'a'.

https://github.com/mike-french/myrex#ambiguous-example

The answer is a dot product of two vectors sliced from Pascal's Triangle.

For m=9, there are 864,146 successful matches.

someguy1010102y ago

might be something like this https://jvns.ca/blog/2016/04/24/how-regular-expressions-go-f... which refs https://en.wikipedia.org/wiki/Brzozowski_derivative

Drup2y ago

https://regex-generate.github.io/regenerate/ (I'm one of the authors) enumerates all the matching (and non-matching) strings, which incidentally answers the question, but doesn't terminate in the infinite case.

clord2y ago

I feel like it might be possible with dataflow analysis. Stepping through the regex maintaining a liveness set or something like that. Sort of like computing exemplar inputs, but with repetition as permitted exemplars. Honestly probably end up re-encoding the regex in some other format, perhaps with 'optimizations applied.'

stvltvs2y ago

The answer is usually an infinite number, except for very, very simple cases. Anything involving * for example means infinity is your answer.

skulk2y ago

I wonder if it makes sense to compute an "order type" for a regexp. For example, a* is omega, a*b* is 2 omega.

https://en.m.wikipedia.org/wiki/Order_type

https://en.wikipedia.org/wiki/Ordinal_number

1 more reply

pimlottc2y ago

What’s your use case?

simlevesque2y ago

Calculate how long it takes to bruteforce something matching a regexp.

_a_a_a_2y ago

Any def for 'difference and intersection of regexes' might actually mean?

I guess for regexes r1 and r2 this means the diff and intersect of their extensional sets, expressed intensionally as a regex. I guess. But nothing seems defined, including what ^ is, or > or whatever. It's not helpful

d662y ago

  negation (~α): strings not matched by α
  difference (α - β): strings matched by α but not β
  intersection (α & β): strings matched by α and β
  exclusive-or (α ^ β): strings matched by α or β but not both
  inclusion (α > β): does α matches all strings β matches?
  equality (α = β): do α and β match exactly the same strings?

less_less2y ago

Interesting. I think this problem is actually EXPSPACE-complete in general? But still has a straightforward algorithm.

https://en.wikipedia.org/wiki/EXPSPACE

DannyBee2y ago

It depends on your operators. For these, no.

Equivalence of DFA or NFA is PSPACE complete by savitch's theorem, regardless of time bound. As such, most types of regex equivalence is pspace-complete.

https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.89....

Has a detailed breakdown of operators vs complexity.

In particular, the paper cited in the expspace page is talking about allowing a squaring operator.

It is EXPSPACE complete if you allow squaring, but not if you use repetition.

IE it is expspace complete if you allow e^2, but not if you only allow ee.

DannyBee2y ago

Since this may be confusing at first (why does squaring buy you anything here) - the reason squaring makes it expspace complete is, basically, squaring allows you to express an exponentially large regex in less than exponential input size.

This in turn means polynomial space in the size of the input is no longer enough to deal with the regex.

If you only allow repetition, than an exponentially large regex requires exponential input size, and thus polynomial space in the size of the input still suffices to do equivalence.

This is generally true - operators that allow you to reduce the size of the input necessary to express a regex by a complexity class will usually increase the size complexity class necessary to determine equivalence by a corresponding amount.

1 more reply

blibble2y ago

it always bugged me as a student that had to sit through all those discrete maths lectures that standard regex libraries don't allow you to union/intersect two "compiled" regular expression objects together

(having to try them one an a time is pretty sad)

snoble2y ago

Oh neat, this is scala via scalajs.

hoten2y ago

On mobile: are the rectangle glyphs as suffixes on the states on purpose or am I missing a font?

progbits2y ago

The states are numbered, $\alpha_0, ..., \alpha_N$ and $\beta_0, ...$. You might be missing the font for the digits.

themusicgod12y ago

ugh STOP USING GITHUB

haltist2y ago

Can LLMs do this?

vore2y ago

I wouldn't use an LLM for anything that can be done 100% precisely, like this.

haltist2y ago

OK, just curious how LLMs are stacking up in logical tasks like this. I kept hearing we were close to AGI so just wondering how far there is to go.

2 more replies

j / k navigate · click thread line to collapse

117 comments

JoelJacobson2y ago

I created a similar regex web demo that shows how a regex is parsed -> NFA -> DFA -> minimal DFA, and finally outputs LLVMIR/Javascript/WebAssembly for from the minimal DFA:

http://compiler.org/reason-re-nfa/src/index.html

eru2y ago

Though going from NFA to explicit DFA isn't always a good idea.

Btw, you might also like looking into the Brzozowski derivative https://en.wikipedia.org/wiki/Brzozowski_derivative which can be used as an alternative way to match regular expressions.

alphablended2y ago

I think it is also worth mentioning that the site linked at the top uses the antimirov extension to brzozovzki work on regex deivatives.

1 more reply

mikhailfranco2y ago

You could implement the NFA directly with concurrent exploration of all paths:

https://github.com/mike-french/myrex

oever2y ago

This library can be used to create string class hierarchies. That, in turn, can help to use typed strings more.

For example, e-mails and urls are a special syntax. Their value space is a subset of all non-empty string which is a subset of all strings.

This library can be used to check the definitions and hierarchy of such string types. The implementation of the hierarchy differs per programming language (subclassing, trait boundaries, etc).

1-more2y ago

In languages with tagged union types you do this a lot! Some Haskell pseudocode for ya

    module Email (Address, fromText, toText) where -- note we do not export the constructor of Address, just the type

    data Address = Address Text

    fromString :: Text -> Maybe Address
    fromString =
        -- you'd do your validation in here and return Nothing if it's a bad address.
        -- Signal validity out of band, not in band with the data.

    toText :: Address -> Text
    toText (Address addr) = addr -- for when you need to output it somewhere

bradrn2y ago

Pedantic note: ‘Address’ should really be a ‘newtype’…

1 more reply

alexeldeib2y ago

> Signal validity out of band, not in band with the data.

Could you expand on this?

1 more reply

croes2y ago

Don't use regex for email address validation

https://news.ycombinator.com/item?id=31092912

usrusr2y ago

1 more reply

_a_a_a_2y ago

> Their value space...

wossis mean? TIA

Edit: instread of downvoting try answering. I'd like to know. TIA{2}

umanwizard2y ago

People are downvoting you because quirky/jokey super-colloquial language like “wossis mean? TIA” is hard to understand, and also just doesn’t really mesh with the vibe of the site.

1 more reply

oever2y ago

Value space is the set of values a type can have. A boolean has only two values in its value space. An unsigned byte has 256 possible values, so does a signed byte.

If the value space is defined by a regular expression, as is often the case, the mentioned library could be used to check, at compile-time, which type are subsets of others.

1 more reply

brianpan2y ago

If I hadn't seen your edit, I might have downvoted the comment for not being intelligible.

klysm2y ago

Regular expressions are a great example of bundling up some really neat and complex mathematical theory into a valuable interface. Linear algebra feels similar to me.

dhosek2y ago

pishpash2y ago

1 more reply

pishpash2y ago

That usually means the representation is getting close to the truth. Good interfaces have intrinsic value, which many result-focused people do not appreciate.

abecedarius2y ago

iirc connections with linear algebra come up in Conway's https://store.doverpublications.com/0486485838.html (which I only skimmed).

Jaxan2y ago

There is a whole field of “weighted automata” which combine linear algebra and automata theory.

poscoOP2y ago

The amazing page computes binary relations between pairs of regular expressions and shows a graphical representation of the DFA.

It’s a really incredible demonstration of some highly non-trivial operations on regular expressions.

vintermann2y ago

rntz2y ago

^ and $ are a problem, although one with a workaround.

The standard workaround is to augment your alphabet with special beginning/end-of-line characters (or beginning/end-of-document), and say that "^" matches the beginning-of-line character.

teraflop2y ago

This page implements regex matching, not searching. So in effect, every pattern has an implicit ^ at the beginning and $ at the end.

o11c2y ago

More interesting is word boundaries:

`\b` is just `\<|\>` though that should be bubbled up and usually only one side will actually produce a matchable regex.

`A\<B` is just `(A&\W)(\w&B)`, and similar for `\>`.

1 more reply

abareplace2y ago

The double quote (") is also broken. If you use it in the regex, then no DFA is displayed.

Sharlin2y ago

As ^ and $ are implicit, you can opt out of them simply by affixing `.*`.

1 more reply

est2y ago

Ha, trying to paste "regex filter numbers divisible by 3" and the page froze to death https://stackoverflow.com/q/10992279/41948

    ^(?:[0369]+|[147](?:[0369]*[147][0369]*[258])*(?:[0369]*[258]|[0369]*[147][0369]*[147])|[258](?:[0369]*[258][0369]*[147])*(?:[0369]*[147]|[0369]*[258][0369]*[258]))+$

    ^([0369]|[147][0369]*[258]|(([258]|[147][0369]*[147])([0369]|[258][0369]*[147])*([147]|[258][0369]\*[258])))+$

I wonder if there's a shortest one.

abareplace2y ago

The web page hangs on the regular expressions that produce a DFA with a lot of states. For example, these ones:

(ab+c+)+

(abc){100}

a.*quick brown fox jumps over the lazy dog

zamadatix2y ago

The page says it doesn't support anchors anyway.

layer82y ago

I wanted to see the intersection between syntactically valid URLs and email addresses, but just entering the URL regex (cf. below) already takes too long to process for the page.

[\-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([\-a-zA-Z0-9()@:%_+.~#?&//=]*)

(source: https://stackoverflow.com/a/3809435/623763)

d662y ago

expressions like (...){1,256} are very heavyweight and the scala JS code ends up timing out or crashing the browser.

if you replace that with (...)+ then it seems to work (at least for me). smaller expressions like (...){1,6} should be fine.

noduerme2y ago

2 more replies

jepler2y ago

This is neat!

    yz([^z][^z]*z|z)*|y[^z](zz*[^z]|[^z])*zz*

instead. I think there are reasons it gives the answer it does, and giving a minimal (by RE length in characters or whatever) regular expression is probably a lot harder.

ufo2y ago

I think one of the reasons is the ".+z" gets bigger and uglier after you convert it to a deterministic automaton.

daveFNbuck2y ago

They show the DFA for it on the site, it's 3 states. There's a starting state for the first . and then two states that transition back and forth between whether z was the last character or not.

rsstack2y ago

Etheryte2y ago

rsstack2y ago

The odds of the sample addresses matching is essentially zero, and adding work to the user is counterproductive.

1 more reply

pimlottc2y ago

Suggestion: turn off auto suggest in the regex input fields to make it more usable on mobile.

https://stackoverflow.com/questions/35513968/disable-autocor...

x-complexity2y ago

I used 2 similar divide-by-3 regexes to test the page (after removing the ^ and $ to their ends), and it froze up:

Regex 2: ([0369]|[258][0369]*[147]|(([147]|[258][0369]*[258])([0369]|[147][0369]*[258])*([258]|[147][0369]*[147])))*

Everything up until the last '*' is parsable. The moment I put in the *, the entire page freezes up.

Without the *, it produced a valid verifier for parsing chunks of digits whose sum mod 3 = 0.

emmanueloga_2y ago

1: https://www.balisage.net/Proceedings/vol23/html/Holstege01/B...

baggy_trough2y ago

I love how it looks like a CS textbook.

perihelions2y ago

[0] e.g. https://graphviz.org/Gallery/directed/fsm.html

therealcamino2y ago

Maybe something related to 'pic'? This doc on it is a revised version of a 1984 edition, so maybe it's a little too late, but there are references to other systems back to 1977 or so.

https://pikchr.org/home/uv/pic.pdf

cobbal2y ago

It has the look of graphviz about it, which is an excellent tool. Often helpful in debugging anything related to graphs.

https://graphviz.org/

simlevesque2y ago

Kinda related but I'm looking for something that could give me the number of possible matching strings for a simple regex. Does such a tool exist ?

contravariant2y ago

I feel like it shouldn't be too hard to calculate from the finite automaton that encodes the regular expression, but surely in most cases it will simply be infinite?

tetha2y ago

And then you add a silly jumble of parenthesis on entry and output to make it right. This was kinda simple to figure out with stuff like (a(ab)*b)* and such.

This is in O(states) for R and O(2^states) for NR if I recall right.

kadoban2y ago

Maybe the number of possible matchings for a given length (or range of lengths) might be interesting?

2 more replies

0823498723498722y ago

see https://www.cs.dartmouth.edu/~doug/nfa.pdf

d662y ago

the page actually does give these. for α := [a-z]{2,4} the page gives |α| = 475228.

rntz2y ago

Here's a simple Haskell program to do it:

    -- https://gist.github.com/rntz/03604e36888a8c6f08bb5e8c665ba9d0

    import qualified Data.List as List

    data Regex = Class [Char]   -- character class
               | Seq [Regex]    -- sequence, ABC
               | Choice [Regex] -- choice, A|B|C
               | Star Regex     -- zero or more, A*
                 deriving (Show)

    data Size = Finite Int | Infinite deriving (Show, Eq)

    instance Num Size where
      abs = undefined; signum = undefined; negate = undefined -- unnecessary
      fromInteger = Finite . fromInteger
      Finite x + Finite y = Finite (x + y)
      _ + _ = Infinite
      Finite x * Finite y = Finite (x * y)
      x * y = if x == 0 || y == 0 then 0 else Infinite

    -- computes size & language (list of matching strings, if regex is finite)
    eval :: Regex -> (Size, [String])
    eval (Class chars) = (Finite (length cset), [[c] | c <- cset])
      where cset = List.nub chars
    eval (Seq regexes) = (product sizes, concat <$> sequence langs)
      where (sizes, langs) = unzip $ map eval regexes
    eval (Choice regexes) = (size, lang)
      where (sizes, langs) = unzip $ map eval regexes
            lang = concat langs
            size = if elem Infinite sizes then Infinite
                   -- finite, so just count 'em. inefficient but works.
                   else Finite (length (List.nub lang))
    eval (Star r) = (size, lang)
      where (rsize, rlang) = eval r
            size | rsize == 0 = 1
                 | rsize == 1 && List.nub rlang == [""] = 1
                 | otherwise = Infinite
            lang = [""] ++ ((++) <$> [x | x <- rlang, x /= ""] <*> lang)

    size :: Regex -> Size
    size = fst . eval

sebzim45002y ago

Surely that fails for e.g. a?a?a?. I'd imagine you could do some sort of simplification first though to avoid this redundancy.

1 more reply

mikhailfranco2y ago

Another interesting question is: how many possible successful matches are there for a given input string. For example:

How many ways can (a?){m}(a*){m} match the string a{m}

i.e. input m repetitions of the letter 'a'.

https://github.com/mike-french/myrex#ambiguous-example

The answer is a dot product of two vectors sliced from Pascal's Triangle.

For m=9, there are 864,146 successful matches.

someguy1010102y ago

might be something like this https://jvns.ca/blog/2016/04/24/how-regular-expressions-go-f... which refs https://en.wikipedia.org/wiki/Brzozowski_derivative

Drup2y ago

clord2y ago

stvltvs2y ago

The answer is usually an infinite number, except for very, very simple cases. Anything involving * for example means infinity is your answer.

skulk2y ago

I wonder if it makes sense to compute an "order type" for a regexp. For example, a* is omega, a*b* is 2 omega.

https://en.m.wikipedia.org/wiki/Order_type

https://en.wikipedia.org/wiki/Ordinal_number

1 more reply

pimlottc2y ago

What’s your use case?

simlevesque2y ago

Calculate how long it takes to bruteforce something matching a regexp.

_a_a_a_2y ago

Any def for 'difference and intersection of regexes' might actually mean?

d662y ago

  negation (~α): strings not matched by α
  difference (α - β): strings matched by α but not β
  intersection (α & β): strings matched by α and β
  exclusive-or (α ^ β): strings matched by α or β but not both
  inclusion (α > β): does α matches all strings β matches?
  equality (α = β): do α and β match exactly the same strings?

less_less2y ago

Interesting. I think this problem is actually EXPSPACE-complete in general? But still has a straightforward algorithm.

https://en.wikipedia.org/wiki/EXPSPACE

DannyBee2y ago

It depends on your operators. For these, no.

Equivalence of DFA or NFA is PSPACE complete by savitch's theorem, regardless of time bound. As such, most types of regex equivalence is pspace-complete.

https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.89....

Has a detailed breakdown of operators vs complexity.

In particular, the paper cited in the expspace page is talking about allowing a squaring operator.

It is EXPSPACE complete if you allow squaring, but not if you use repetition.

IE it is expspace complete if you allow e^2, but not if you only allow ee.

DannyBee2y ago

This in turn means polynomial space in the size of the input is no longer enough to deal with the regex.

If you only allow repetition, than an exponentially large regex requires exponential input size, and thus polynomial space in the size of the input still suffices to do equivalence.

1 more reply

blibble2y ago

(having to try them one an a time is pretty sad)

snoble2y ago

Oh neat, this is scala via scalajs.

hoten2y ago

On mobile: are the rectangle glyphs as suffixes on the states on purpose or am I missing a font?

progbits2y ago

The states are numbered, $\alpha_0, ..., \alpha_N$ and $\beta_0, ...$. You might be missing the font for the digits.

themusicgod12y ago

ugh STOP USING GITHUB

haltist2y ago