- A list generator. Enter a regex, set repetition operator constraints (e.g. ->{0,3}, +->{1,3}, .->[A-Z0-9 ], etc.) and have it exhaustively generate a list of matching strings. This is helpful when you have a regex that matches your test strings, but also to let you know what else* it'll match. The constraints are to keep it from generating infinite lists. Even if it jams out tens or hundreds of thousands of produced strings, it's still useful. I've found that most people just build up the first regex that will "match" their input text, and move on without thinking about all the edge cases they've just introduced.
- A regex assembler optimizer. Give it a few regexes, have it assemble them into one large regex and optimize it. It's got to do better than just | or'ing all the regexes together. I've seen some work done on using trie variants to do this, but have no idea how far along the work is on this.
- A regex list generator. Give it a list of strings you want to match and have it generate a regex. A sliding "fuzziness" control could tell it to take alternates in the same character position and substitute either
1. Just the characters in the given list - a, t and q in the same position generates a|t|q
2. A representative narrow character range - if I give it a|t|q it knows to use [A-Z] while a|t|q|4 might generate [A-Z0-9]
3. A larger character range, a|t|q might just go ahead and produce [A-Z0-9]
4. An even larger character range, whatever it is, just use .
And maybe another slider for repetitions, so if I end up with [A-Z][A-Z][A-Z], should it just produce [A-Z]{3} or can I go ahead and have it [A-Z]+
Jam the result through an optimizer (see previous idea above) to clean up the regex and maybe even run it through the list generator to check if it produces only what you want.
That should be unnecessary if your regex engine does the dfa transformation. basically, converts the regexp into a state machine and then it combines all of the branches in the state machine to generate synthetic states that can represent the "superposition" of matching multiple branches. this means your regex (once compiled) will run in bounded memory and max time proportional to the input (iirc)
I've generated some very massive regex's that are quite speedy.
Merger
https://metacpan.org/pod/Regexp::Assemble
These are also super handy https://metacpan.org/pod/Number::Range::Regex
https://metacpan.org/pod/Regexp::CommonI'd really love to see a better regex syntax. The current obviously is deficient beyond repair. The tools cannot address the root of the problem.
𝄞 is one char, not two. привет is matched by \w+.
PS there's some advanced stuff but where is basic [[:posix:]] char classes?
It seems a very nice regex page otherwise.
Are you perhaps trying to use a RegEx feature that is not supported by JS? Currently, RegExr only supports the JS flavour of RegEx.
> RegExr only supports modern desktop browsers.
I'm using Firefox 30 on Ubuntu. I think it's plenty modern :)
Obviously they need to fix the error message...
Is there a tool that allows you to highlight portions of a string and generate a corresponding regex? (i.e. the inverse of RegExr)
Consider the string abcdefgh
Guess what!? I have the perfect regex to match your string.
"abcdefgh"
So given a string literal, there is always a regex to match that literal. Namely, the literal itself.Really, what you want is a tool that, given several examples, will generate a regex that matches all of them.
So you'd give it:
aaaaabaa
aabaaa
aba
abaaaaa
And it'd generate "a+ba+"The problem with that is, given a corpus with a set of tokens { T0, T1, T2 ... }, I can give you a regex that will match the corpus!
"[T0 T1 T2 ... ]*"
or even ".*"
So it will match everything in your corpus! But unfortunately, it will match a whole lot you don't want, too.So ideally you want a regex that matches everything in your corpus, but nothing outside the language you are trying to describe. This requires both positive and negative learning examples. The problem is that for most applications, you'd need a lot of negative examples.
Source: Working on this exact problem for graduate research
But that's pretty stupid, because you don't generalize beyond your examples.
What's your approach?
<em>edit: removed random conjecture</em>
Given that, what I think you're really asking is, "how do I automatically generate a regex of optimal conciseness given a set of inputs I'd like to match, and maybe a bunch of other inputs I want to avoid matching?"
This looks like it iteratively does what you want: http://regex.inginf.units.it/ (Note that when I went there, it said "6 slots available", presumably because everything runs server-side. If a bunch of people pile in there, you probably won't actually be able to test it due to limited resources on their part.)
Try \w+ instead.