RegExr 2.0 (opens in new tab)

(regexr.com)

106 points_kushagra12y ago43 comments

43 comments

bane12y ago

Regex testing is cool, but there are dozens of these kinds of tools and I'd really love to see some other kinds of regex tools

- A list generator. Enter a regex, set repetition operator constraints (e.g. ->{0,3}, +->{1,3}, .->[A-Z0-9 ], etc.) and have it exhaustively generate a list of matching strings. This is helpful when you have a regex that matches your test strings, but also to let you know what else* it'll match. The constraints are to keep it from generating infinite lists. Even if it jams out tens or hundreds of thousands of produced strings, it's still useful. I've found that most people just build up the first regex that will "match" their input text, and move on without thinking about all the edge cases they've just introduced.

- A regex assembler optimizer. Give it a few regexes, have it assemble them into one large regex and optimize it. It's got to do better than just | or'ing all the regexes together. I've seen some work done on using trie variants to do this, but have no idea how far along the work is on this.

- A regex list generator. Give it a list of strings you want to match and have it generate a regex. A sliding "fuzziness" control could tell it to take alternates in the same character position and substitute either

1. Just the characters in the given list - a, t and q in the same position generates a|t|q

2. A representative narrow character range - if I give it a|t|q it knows to use [A-Z] while a|t|q|4 might generate [A-Z0-9]

3. A larger character range, a|t|q might just go ahead and produce [A-Z0-9]

4. An even larger character range, whatever it is, just use .

And maybe another slider for repetitions, so if I end up with [A-Z][A-Z][A-Z], should it just produce [A-Z]{3} or can I go ahead and have it [A-Z]+

Jam the result through an optimizer (see previous idea above) to clean up the regex and maybe even run it through the list generator to check if it produces only what you want.

aaronblohowiak12y ago

>- A regex assembler optimizer. Give it a few regexes, have it assemble them into one large regex and optimize it. It's got to do better than just | or'ing all the regexes together. I've seen some work done on using trie variants to do this, but have no idea how far along the work is on this.

That should be unnecessary if your regex engine does the dfa transformation. basically, converts the regexp into a state machine and then it combines all of the branches in the state machine to generate synthetic states that can represent the "superposition" of matching multiple branches. this means your regex (once compiled) will run in bounded memory and max time proportional to the input (iirc)

lespea12y ago

I actually do the combining idea all the time. As long as the language is roughly pcre compatible you can use this to spit out your regex and (if necessary for your alternate language tweak it a bit so it fits).

I've generated some very massive regex's that are quite speedy.

Merger

  https://metacpan.org/pod/Regexp::Assemble

These are also super handy

  https://metacpan.org/pod/Number::Range::Regex
  https://metacpan.org/pod/Regexp::Common

bane12y ago

Yeah, Regexp::Assemble was what I had in mind. There's a few that try to generate a list of matching strings from the expression, but I've never been satisfied with their output. Either they're slow, or don't let you constrain the regex, and all of them don't generate comprehensive lists for some reason.

ExpiredLink12y ago

> I'd really love to see some other kinds of regex tools

I'd really love to see a better regex syntax. The current obviously is deficient beyond repair. The tools cannot address the root of the problem.

jacobolus12y ago

http://www.perl6.org/archive/doc/design/apo/A05.html

jonny_eh12y ago

Why don't you take a crack at it?

lelf12y ago

People just cannot do unicode even remotely properly. Just cannot.

𝄞 is one char, not two. привет is matched by \w+.

PS there's some advanced stuff but where is basic [[:posix:]] char classes?

eurg12y ago

Just to make it clear: It does not even support the basic Latin-1 charset correctly. Matching my family-name requires manual intervention. This is sad.

It seems a very nice regex page otherwise.

gskinner12y ago

Creator here - can you elaborate? What is your family name? The example in this thread ("Grüneis") matches and displays correctly in all the browsers I've tested.

Are you perhaps trying to use a RegEx feature that is not supported by JS? Currently, RegExr only supports the JS flavour of RegEx.

eurg12y ago

Forget it, I was not used to JavaScript RegEx. I just looked it up on MDN, and it really defines `\w` to be very limited. Doesn't really make it any better, but whatever.

dfc12y ago

Family name = Grüneis

Svip12y ago

It doesn't support \p{} either for matching Unicode classes. e.g. \p{Lu} matches uppercase letters (so also Æ and Ö counts).

core102412y ago

I couldn't find a way to add the /u or /s flag. There are only allowed /i, /g and /m :(

gskinner12y ago

Creator here - we are currently relying on the JS RegExp API, and thus only support features of that engine, which are somewhat limited. In the future, we may support other flavours. We may also add specific errors for more common features that are not supported, as I've already done for lookbehinds.

shock12y ago

> Uh-oh, it looks like your browser is not supported.

> RegExr only supports modern desktop browsers.

I'm using Firefox 30 on Ubuntu. I think it's plenty modern :)

LukeB_UK12y ago

I get the same message with chrome 34 on android 4.4.2

rguldener12y ago

Pretty sure Android is not commonly considered a desktop system ;) Though mobile (or at least tablet) support would be cool

eik3_de12y ago

same here :(

isomorphic12y ago

I got the same message (FF28 Mac). Then I turned session cookies on and it worked.

Obviously they need to fix the error message...

jvehent12y ago

No problem here with Fedora 20 and Firefox 31.0a1 (2014-04-04)

markbnj12y ago

Very nicely done. As someone else pointed out there are quite a few of these tools, but I think you've done a really nice job with this one. One suggestion: make the reference easier to scan at a top level as opposed to drilling down.

nlh12y ago

I'm guessing the following is either near-impossible or pure-impossible, but:

Is there a tool that allows you to highlight portions of a string and generate a corresponding regex? (i.e. the inverse of RegExr)

gamegoblin12y ago

Here is the problem with that:

Consider the string abcdefgh

Guess what!? I have the perfect regex to match your string.

  "abcdefgh"

So given a string literal, there is always a regex to match that literal. Namely, the literal itself.

Really, what you want is a tool that, given several examples, will generate a regex that matches all of them.

So you'd give it:

  aaaaabaa
  aabaaa
  aba
  abaaaaa

And it'd generate "a+ba+"

The problem with that is, given a corpus with a set of tokens { T0, T1, T2 ... }, I can give you a regex that will match the corpus!

  "[T0 T1 T2 ... ]*"

or even

  ".*"

So it will match everything in your corpus! But unfortunately, it will match a whole lot you don't want, too.

So ideally you want a regex that matches everything in your corpus, but nothing outside the language you are trying to describe. This requires both positive and negative learning examples. The problem is that for most applications, you'd need a lot of negative examples.

Source: Working on this exact problem for graduate research

nmrm12y ago

T0 | T1 | T2 | ... would match exactly the correct thing with all positive examples, and (T0 | T1 | T2) & !(CE1 | CE2 | CE3) would match exactly the correct thing with positive and negative examples.

But that's pretty stupid, because you don't generalize beyond your examples.

What's your approach?

<em>edit: removed random conjecture</em>

gamegoblin12y ago

You have to have some sort of heuristic that determines what a "good" regex is, since there are undoubtedly multiple regexes that describe a corpus.

A simple heuristic is the smallest regex.

So in your example, given the training examples:

  aba
  abaa
  aaaaba

and the counter examples:

  abba
  ba
  ab

It's clear to a human I probably want to match "a+ba+". That's clearly much smaller than ("aba" | "abaa" | "aaaaba") & !("abba" | "ba" | "ab"), so it would be a "better" regex.

weavie12y ago

Sounds like you want to be able to specify some kind of pattern to define accepted and rejected matches. A regex would be ideal for this. oh wait....

Dewie12y ago

Since you're a researcher I must be missing something. But since regexps are closed under union, what is the problem with taking the union of all of them? I'm imagining that it would be conceptually simple to hook up all of the non deterministic state machines such that you get a non deterministic state machine which is the union of all of them. Then convert it to a deterministic state machine. You might get state explosion, but at least you would have found some machine to recognize the language. Is state minimization simple (complexity wise)? Is it even possible to find a decently small DSM in the general case (not necessarily the most minimal machine)?

gamegoblin12y ago

My reply to nmrm might answer your question.

Finding some regular expression that matches all of the positive examples and does not match all of the negative examples is trivial. Finding a good regular expression that does that is not.

State minimization does not mitigate this problem. As an aside, state minimization is a polynomial algorithm.

Given the positive examples:

  aba
  abaa
  aaaaba

and the negative examples:

  abba
  ba
  ab

we could make a regex that does something like ("aba" | "abaa" | "aaaaba") & !("abba" | "ba" | "ab"), but unfortunately, running a state minimization algorithm on this regex does not give you "a+ba+" because the two regex are not equivalent (they do not accept the same language).

So you can find plenty of regex that will match your examples and not match your counterexamples, but you cannot easily minimize them to what you do want.

staticshock12y ago

"aaa" is a valid regex that matches the string "aaa". If you have special characters in your source string, many libraries have a regex for escaping them. So, generating a regex to match your exact string is trivial. Even matching a group of strings is trivial via (aaa|bbb|etc), though it gets long.

Given that, what I think you're really asking is, "how do I automatically generate a regex of optimal conciseness given a set of inputs I'd like to match, and maybe a bunch of other inputs I want to avoid matching?"

This looks like it iteratively does what you want: http://regex.inginf.units.it/ (Note that when I went there, it said "6 slots available", presumably because everything runs server-side. If a bunch of people pile in there, you probably won't actually be able to test it due to limited resources on their part.)

mc_hammer12y ago

this is very cool ty for the post.

fancy_betta12y ago

Probably not what you're looking for, but check out http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313....

chomp12y ago

Regexp::Assemble kind of gets you there, you can feed it strings and it'll spit out a regex.

ChrisGaudreau12y ago

Reminds me of http://rubular.com/, except it isn't Ruby-focused and is more community-based. Seems pretty cool.

dehrmann12y ago

Or http://www.regexplanet.com/, but regex planet supports a lot more flavors.

RussianCow12y ago

Or http://re-try.appspot.com/ for a Python equivalent.

mck-12y ago

There is one for Javascript that I use pretty often: http://scriptular.com/ (based on the Ruby one: http://www.rubular.com/)

cygni12y ago

https://www.debuggex.com/ is a nice alternative that is a little different from other regex sites I've seen.

strictfp12y ago

Why can I not match against "\w*" for instance? It just says "infinite" and does not seem to attempt to match.

gskinner12y ago

Creator here - this is because \w* matches 0 characters, and thus matches infinitely. You can roll over the "infinite" error for details, or look in the help.

Try \w+ instead.

strictfp12y ago

But \w* matches "" and "abc" but not "!a". How can I test this with your tool if \w* always says "infinite"?

hardwaresofton12y ago

one of the best regular expression testers online just got better. Great site, love it

tuananh12y ago

a side note: i found Patterns app on OS X very useful for regex.

j / k navigate · click thread line to collapse

43 comments

bane12y ago

Regex testing is cool, but there are dozens of these kinds of tools and I'd really love to see some other kinds of regex tools

1. Just the characters in the given list - a, t and q in the same position generates a|t|q

2. A representative narrow character range - if I give it a|t|q it knows to use [A-Z] while a|t|q|4 might generate [A-Z0-9]

3. A larger character range, a|t|q might just go ahead and produce [A-Z0-9]

4. An even larger character range, whatever it is, just use .

And maybe another slider for repetitions, so if I end up with [A-Z][A-Z][A-Z], should it just produce [A-Z]{3} or can I go ahead and have it [A-Z]+

Jam the result through an optimizer (see previous idea above) to clean up the regex and maybe even run it through the list generator to check if it produces only what you want.

aaronblohowiak12y ago

lespea12y ago

I've generated some very massive regex's that are quite speedy.

Merger

  https://metacpan.org/pod/Regexp::Assemble

These are also super handy

  https://metacpan.org/pod/Number::Range::Regex
  https://metacpan.org/pod/Regexp::Common

bane12y ago

ExpiredLink12y ago

> I'd really love to see some other kinds of regex tools

I'd really love to see a better regex syntax. The current obviously is deficient beyond repair. The tools cannot address the root of the problem.

jacobolus12y ago

http://www.perl6.org/archive/doc/design/apo/A05.html

jonny_eh12y ago

Why don't you take a crack at it?

lelf12y ago

People just cannot do unicode even remotely properly. Just cannot.

𝄞 is one char, not two. привет is matched by \w+.

PS there's some advanced stuff but where is basic [[:posix:]] char classes?

eurg12y ago

Just to make it clear: It does not even support the basic Latin-1 charset correctly. Matching my family-name requires manual intervention. This is sad.

It seems a very nice regex page otherwise.

gskinner12y ago

Creator here - can you elaborate? What is your family name? The example in this thread ("Grüneis") matches and displays correctly in all the browsers I've tested.

Are you perhaps trying to use a RegEx feature that is not supported by JS? Currently, RegExr only supports the JS flavour of RegEx.

eurg12y ago

Forget it, I was not used to JavaScript RegEx. I just looked it up on MDN, and it really defines `\w` to be very limited. Doesn't really make it any better, but whatever.

dfc12y ago

Family name = Grüneis

Svip12y ago

It doesn't support \p{} either for matching Unicode classes. e.g. \p{Lu} matches uppercase letters (so also Æ and Ö counts).

core102412y ago

I couldn't find a way to add the /u or /s flag. There are only allowed /i, /g and /m :(

gskinner12y ago

shock12y ago

> Uh-oh, it looks like your browser is not supported.

> RegExr only supports modern desktop browsers.

I'm using Firefox 30 on Ubuntu. I think it's plenty modern :)

LukeB_UK12y ago

I get the same message with chrome 34 on android 4.4.2

rguldener12y ago

Pretty sure Android is not commonly considered a desktop system ;) Though mobile (or at least tablet) support would be cool

eik3_de12y ago

same here :(

isomorphic12y ago

I got the same message (FF28 Mac). Then I turned session cookies on and it worked.

Obviously they need to fix the error message...

jvehent12y ago

No problem here with Fedora 20 and Firefox 31.0a1 (2014-04-04)

markbnj12y ago

nlh12y ago

I'm guessing the following is either near-impossible or pure-impossible, but:

Is there a tool that allows you to highlight portions of a string and generate a corresponding regex? (i.e. the inverse of RegExr)

gamegoblin12y ago

Here is the problem with that:

Consider the string abcdefgh

Guess what!? I have the perfect regex to match your string.

  "abcdefgh"

So given a string literal, there is always a regex to match that literal. Namely, the literal itself.

Really, what you want is a tool that, given several examples, will generate a regex that matches all of them.

So you'd give it:

  aaaaabaa
  aabaaa
  aba
  abaaaaa

And it'd generate "a+ba+"

The problem with that is, given a corpus with a set of tokens { T0, T1, T2 ... }, I can give you a regex that will match the corpus!

  "[T0 T1 T2 ... ]*"

or even

  ".*"

So it will match everything in your corpus! But unfortunately, it will match a whole lot you don't want, too.

Source: Working on this exact problem for graduate research

nmrm12y ago

T0 | T1 | T2 | ... would match exactly the correct thing with all positive examples, and (T0 | T1 | T2) & !(CE1 | CE2 | CE3) would match exactly the correct thing with positive and negative examples.

But that's pretty stupid, because you don't generalize beyond your examples.

What's your approach?

<em>edit: removed random conjecture</em>

gamegoblin12y ago

You have to have some sort of heuristic that determines what a "good" regex is, since there are undoubtedly multiple regexes that describe a corpus.

A simple heuristic is the smallest regex.

So in your example, given the training examples:

  aba
  abaa
  aaaaba

and the counter examples:

  abba
  ba
  ab

It's clear to a human I probably want to match "a+ba+". That's clearly much smaller than ("aba" | "abaa" | "aaaaba") & !("abba" | "ba" | "ab"), so it would be a "better" regex.

weavie12y ago

Sounds like you want to be able to specify some kind of pattern to define accepted and rejected matches. A regex would be ideal for this. oh wait....

Dewie12y ago

gamegoblin12y ago

My reply to nmrm might answer your question.

Finding some regular expression that matches all of the positive examples and does not match all of the negative examples is trivial. Finding a good regular expression that does that is not.

State minimization does not mitigate this problem. As an aside, state minimization is a polynomial algorithm.

Given the positive examples:

  aba
  abaa
  aaaaba

and the negative examples:

  abba
  ba
  ab

So you can find plenty of regex that will match your examples and not match your counterexamples, but you cannot easily minimize them to what you do want.

staticshock12y ago

mc_hammer12y ago

this is very cool ty for the post.

fancy_betta12y ago

Probably not what you're looking for, but check out http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313....

chomp12y ago

Regexp::Assemble kind of gets you there, you can feed it strings and it'll spit out a regex.

ChrisGaudreau12y ago

Reminds me of http://rubular.com/, except it isn't Ruby-focused and is more community-based. Seems pretty cool.

dehrmann12y ago

Or http://www.regexplanet.com/, but regex planet supports a lot more flavors.

RussianCow12y ago

Or http://re-try.appspot.com/ for a Python equivalent.

mck-12y ago

There is one for Javascript that I use pretty often: http://scriptular.com/ (based on the Ruby one: http://www.rubular.com/)

cygni12y ago

https://www.debuggex.com/ is a nice alternative that is a little different from other regex sites I've seen.

strictfp12y ago

Why can I not match against "\w*" for instance? It just says "infinite" and does not seem to attempt to match.

gskinner12y ago

Creator here - this is because \w* matches 0 characters, and thus matches infinitely. You can roll over the "infinite" error for details, or look in the help.

Try \w+ instead.

strictfp12y ago

But \w* matches "" and "abc" but not "!a". How can I test this with your tool if \w* always says "infinite"?

hardwaresofton12y ago

one of the best regular expression testers online just got better. Great site, love it

tuananh12y ago

a side note: i found Patterns app on OS X very useful for regex.

j / k navigate · click thread line to collapse