Btw, you might also like looking into the Brzozowski derivative https://en.wikipedia.org/wiki/Brzozowski_derivative which can be used as an alternative way to match regular expressions.
For example, e-mails and urls are a special syntax. Their value space is a subset of all non-empty string which is a subset of all strings.
An e-mail address could be passed into a function that requires a non-empty string as input. When the type-system knows that an e-mail string is a subclass of non-empty string, it knows that an email address is valid.
This library can be used to check the definitions and hierarchy of such string types. The implementation of the hierarchy differs per programming language (subclassing, trait boundaries, etc).
module Email (Address, fromText, toText) where -- note we do not export the constructor of Address, just the type
data Address = Address Text
fromString :: Text -> Maybe Address
fromString =
-- you'd do your validation in here and return Nothing if it's a bad address.
-- Signal validity out of band, not in band with the data.
toText :: Address -> Text
toText (Address addr) = addr -- for when you need to output it somewhereCould you expand on this?
Don't use regex for email address validation
wossis mean? TIA
Edit: instread of downvoting try answering. I'd like to know. TIA{2}
A string enumeration has a limited number of values. E.g. type A ("Yes" | "No" | "Maybe") has three values and is a superset of type B ("Yes" | "No"). A function that accepts type A can also accept type B as valid input.
If the value space is defined by a regular expression, as is often the case, the mentioned library could be used to check, at compile-time, which type are subsets of others.
It’s a really incredible demonstration of some highly non-trivial operations on regular expressions.
The standard theory of regular expressions focuses entirely on regex matching, rather than searching. For matching, ^ and $ don't really mean anything. In particular, regexp theory is defined in terms of the "language of" a regexp: the set of strings which match it. What's the set of strings that "^" matches? Well, it's the empty string, but only if it comes at the beginning of a line (or sometimes the beginning of the document). This beginning-of-line constraint doesn't fit nicely into the "a regexp is defined by its language/set of strings" theory, much the same way lookahead/lookbehind assertions don't quite fit the theory of regular expressions.
The standard workaround is to augment your alphabet with special beginning/end-of-line characters (or beginning/end-of-document), and say that "^" matches the beginning-of-line character.
More interesting is word boundaries:
`\b` is just `\<|\>` though that should be bubbled up and usually only one side will actually produce a matchable regex.
`A\<B` is just `(A&\W)(\w&B)`, and similar for `\>`.
^(?:[0369]+|[147](?:[0369]*[147][0369]*[258])*(?:[0369]*[258]|[0369]*[147][0369]*[147])|[258](?:[0369]*[258][0369]*[147])*(?:[0369]*[147]|[0369]*[258][0369]*[258]))+$
^([0369]|[147][0369]*[258]|(([258]|[147][0369]*[147])([0369]|[258][0369]*[147])*([147]|[258][0369]\*[258])))+$
I wonder if there's a shortest one.(ab+c+)+
(abc){100}
a.*quick brown fox jumps over the lazy dog
[\-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([\-a-zA-Z0-9()@:%_+.~#?&//=]*)
if you replace that with (...)+ then it seems to work (at least for me). smaller expressions like (...){1,6} should be fine.
I was surprised then not surprised that the union & intersection REs it comes up with are not particularly concise. For example the two expressions "y.+" and ".+z" have a very simple intersection: "y.*z" (equality verified by the page, assuming I haven't typo'd anything). But the tool gives
yz([^z][^z]*z|z)*|y[^z](zz*[^z]|[^z])*zz*
instead. I think there are reasons it gives the answer it does, and giving a minimal (by RE length in characters or whatever) regular expression is probably a lot harder.I think what's actually happening here is that they're doing the intersection on the DFAs and then producing a regex from the resulting DFA. The construction of a regex from a DFA is where things get ugly and weird.
https://stackoverflow.com/questions/35513968/disable-autocor...
Regex 1: ([0369]|([258]|[147][0369]*[147])([0369]|([147][0369]*[258]|[258][0369]*[147]))*([147]|[258][0369]*[258])|([147]|[258][0369]*[258])([0369]|([147][0369]*[258]|[258][0369]*[147]))*([258]|[147][0369]*[147]))*
Regex 2: ([0369]|[258][0369]*[147]|(([147]|[258][0369]*[258])([0369]|[147][0369]*[258])*([258]|[147][0369]*[147])))*
Everything up until the last '*' is parsable. The moment I put in the *, the entire page freezes up.
Without the *, it produced a valid verifier for parsing chunks of digits whose sum mod 3 = 0.
Combined with the fact the regular expressions can be used not only on strings but more generally (e.g. for JSON schema validation [1]), this could be a possible implementation of static checks, similar to "design by contract".
--
1: https://www.balisage.net/Proceedings/vol23/html/Holstege01/B...
Basically, you need an accumulator of "stuff up to here". If you move from a node to a second node, you add the character annotating that edge to the accumulator. And whenever you end up with an edge to a visited node, you add a '*' and output that, and for leaf nodes, you output the accumulator.
And then you add a silly jumble of parenthesis on entry and output to make it right. This was kinda simple to figure out with stuff like (a(ab)*b)* and such.
This is in O(states) for R and O(2^states) for NR if I recall right.
however, as others have pointed out any non-trivial use of the kleene star means the result will be ∞. in this case the page will list numbers that roughly correspond to "number of strings with N applications of kleene star" in addition to infinity.
(EDIT: this code is completely wrongheaded and does not work; it assumes that when sequencing regexes, you can take the product of their sizes to find the overall size. This is just not true. See reply, below, for an example.)
-- https://gist.github.com/rntz/03604e36888a8c6f08bb5e8c665ba9d0
import qualified Data.List as List
data Regex = Class [Char] -- character class
| Seq [Regex] -- sequence, ABC
| Choice [Regex] -- choice, A|B|C
| Star Regex -- zero or more, A*
deriving (Show)
data Size = Finite Int | Infinite deriving (Show, Eq)
instance Num Size where
abs = undefined; signum = undefined; negate = undefined -- unnecessary
fromInteger = Finite . fromInteger
Finite x + Finite y = Finite (x + y)
_ + _ = Infinite
Finite x * Finite y = Finite (x * y)
x * y = if x == 0 || y == 0 then 0 else Infinite
-- computes size & language (list of matching strings, if regex is finite)
eval :: Regex -> (Size, [String])
eval (Class chars) = (Finite (length cset), [[c] | c <- cset])
where cset = List.nub chars
eval (Seq regexes) = (product sizes, concat <$> sequence langs)
where (sizes, langs) = unzip $ map eval regexes
eval (Choice regexes) = (size, lang)
where (sizes, langs) = unzip $ map eval regexes
lang = concat langs
size = if elem Infinite sizes then Infinite
-- finite, so just count 'em. inefficient but works.
else Finite (length (List.nub lang))
eval (Star r) = (size, lang)
where (rsize, rlang) = eval r
size | rsize == 0 = 1
| rsize == 1 && List.nub rlang == [""] = 1
| otherwise = Infinite
lang = [""] ++ ((++) <$> [x | x <- rlang, x /= ""] <*> lang)
size :: Regex -> Size
size = fst . eval
NB. Besides the utter wrong-headedness of the `product` call, the generated string-sets may not be exhaustive for infinite languages, and the original version (I have since edited it) was wrong in several cases for Star (if the argument was nullable or empty).How many ways can (a?){m}(a*){m} match the string a{m}
i.e. input m repetitions of the letter 'a'.
https://github.com/mike-french/myrex#ambiguous-example
The answer is a dot product of two vectors sliced from Pascal's Triangle.
For m=9, there are 864,146 successful matches.
I guess for regexes r1 and r2 this means the diff and intersect of their extensional sets, expressed intensionally as a regex. I guess. But nothing seems defined, including what ^ is, or > or whatever. It's not helpful
negation (~α): strings not matched by α
difference (α - β): strings matched by α but not β
intersection (α & β): strings matched by α and β
exclusive-or (α ^ β): strings matched by α or β but not both
inclusion (α > β): does α matches all strings β matches?
equality (α = β): do α and β match exactly the same strings?Equivalence of DFA or NFA is PSPACE complete by savitch's theorem, regardless of time bound. As such, most types of regex equivalence is pspace-complete.
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.89....
Has a detailed breakdown of operators vs complexity.
In particular, the paper cited in the expspace page is talking about allowing a squaring operator.
It is EXPSPACE complete if you allow squaring, but not if you use repetition.
IE it is expspace complete if you allow e^2, but not if you only allow ee.
This in turn means polynomial space in the size of the input is no longer enough to deal with the regex.
If you only allow repetition, than an exponentially large regex requires exponential input size, and thus polynomial space in the size of the input still suffices to do equivalence.
This is generally true - operators that allow you to reduce the size of the input necessary to express a regex by a complexity class will usually increase the size complexity class necessary to determine equivalence by a corresponding amount.
(having to try them one an a time is pretty sad)