For example, <http://example.com./> , <http:///example.com/> and <https://en.wikipedia.org/wiki/Space (punctuation)> are classified as invalid urls in the blog, but they are accepted in the browser.
As the creator of cURL puts it, there is no URL standard[3].
[1]: https://www.ietf.org/rfc/rfc3986.txt
[2]: https://www.ietf.org/rfc/rfc3987.txt
[3]: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/
yeah that's it that's what they're trying to validate
^(?:http(?:(?:://(?:(?:(?:code\.google\.com/events/#&product=browser|\-\.~_!\$&'\(\)\*\+,;=:%40:80%2f::::::@ex\.com|foo\.(?:bar/\?q=Test%20URL\-encoded%20stuff|com/(?:\(something\)\?after=parens|unicode_\(\)_in_parens|b_(?:\(wiki\)(?:_blah)?#cite\-1|b(?:_\(wiki\)_\(again\)|/))))|uid(?::password@ex\.com(?::8080)?/|@ex\.com(?::8080)?/)|www\.ex\.com/wpstyle/\?p=364|223\.255\.255\.254|उदाहरण\.परीक्षा|1(?:42\.42\.1\.1/|337\.net)|مثال\.إختبار|df\.ws/123|a\.b\-c\.de|\.ws/䨹|⌘\.ws/|例子\.测试|j\.mp)|142\.42\.1\.1:8080/)|\.damowmow\.com/)|s://(?:www\.ex\.com/foo/\?bar=baz&inga=42&quux|foo_bar\.ex\.com/))|://(?:uid(?::password@ex\.com(?::8080)?|@ex\.com(?::8080)?)|foo\.com/b_b(?:_\(wiki\))?|⌘\.ws))|ftp://foo\.bar/baz)$
I had to replace some words with shorter ones to squeeze under 1000 char limit and there's no way to provide negative examples right now. Something to fix!I guess this is the regex equivalent of overfitting :)
This kind of feels like a magic spell :)
In search of the perfect URL validation regex - https://news.ycombinator.com/item?id=10019795 - Aug 2015 (77 comments)
In search of the perfect URL validation regex - https://news.ycombinator.com/item?id=7928968 - June 2014 (81 comments)
At Transcend, we need to allow site owners to regulate any arbitrary network traffic, so our data flow input UI¹ was designed to detect all valid hosts (including local hosts, IDN, IPv6 literal addresses, etc) and URLs (host-relative, protocol-relative, and absolute). If the site owner inputs content that is not a valid host or URL, then we treat their input as a regex.
I came up with these simple utilities built on top of the URL interface standard² to detect all valid hosts & URLs:
• isValidHost: https://gist.github.com/eligrey/6549ad0a635fa07749238911b429...
Example valid inputs:
host.example
はじめよう.みんな (IDN domain; xn--p8j9a0d9c9a.xn--q9jyb4c)
[::1] (IPv6 address)
0xdeadbeef (IPv4 address; 222.173.190.239)
123.456 (IPv4 address; 123.0.1.200)
123456789 (IPv4 address; 7.91.205.21)
localhost
• isValidURL (and isValidAbsoluteURL): https://gist.github.com/eligrey/443d51fab55864005ffb3873204b...Example valid inputs to isValidURL:
https://absolute-url.example
//relative-protocol.example
/relative-path-example
1. https://docs.transcend.io/docs/configuring-data-flowsi've found some reasoning[0] as to why its not supported with browsers in mind though.
Well, that should make things a lot easier. What does he mean here? The rest of the text doesn't make it clear to me, unless it's meant to be "every possibly valid HTTP, HTTPS, or FTP URL" which isn't exactly "the opposite".
> Assume that this regex will be used for a public URL shortener written in PHP, so URLs like http://localhost/, //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and tel:+1234567890 shouldn’t pass (even though they’re technically valid). Also, in this case I only want to allow the HTTP, HTTPS and FTP protocols.
So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong. Are URLs describable in a Chomsky Type 3 grammar? Are they sufficiently regular that using a Regex is sensible? What do the actual browsers do?
[1] https://stackoverflow.com/questions/1732348/regex-match-open...
URLs are not recursive structures, so I’d say the single hardest feature of html is not present.
There was a Perl program that would take something like a BNF and barf out a gigantic regex (maybe with some maximum depth).
Yes, if your regex is above {.../50/100/...} characters, then write parser.
I struggle to understand why do people write those crazy regexes for emails, urls, html when probably in all popular technologies there are battle-tested parsers for those things.
As an example, http://localhost/ is technically valid url, which he wants to block. Should this error say misformatted URL like all others?
Using regex to cover all such cases is really the wrong tool for the job.
Most URL validation rules/regex/librairies/etc. reject "example.com". However, if you head over to Stripe (for example), in the account settings, when asked for your company's URL, Stripe will accept "example.com", and assume "http://" as the prefix (which yes, can have its own problems)
What's a good solution? I both want to validate URLs, but also let users enter "example.com". But if I simply do
if(validateURL(url)) {
return true;
} else if(validateURL("http://" + url)) {
return true;
} else {
return false;
}
i.e. validate the given URL, and as a fallback, try to validate "http://" + the given url, that opens the door to weird, non-URLs strings being incorrectly validated...Help :-)
if (validateURL(url)) {
return url;
} else if (validateURL("http://" + url)) {
return "http://" + url;
} else {
return null;
} if (!validateURL(url)) {
url = "http://" + url;
if (!validateURL(url)) {
url = null;
}
}
return url;
to snip a small probability of a bug?I'd probably do the same for poorly formatted URLs. When the user hits Submit, a prompt appears saying, "Did you mean `https://example.com`?"
if it's really important, you could try making a request to the url and see if it loads, but that still doesn't validate its the url they intended to input.
might be cool to load the url with puppeteer and capture a screenshot of the page. if they can't recognize their own website, it's on them.
@^(https?|ftp)://[^\s/$.?#].[^\s]*$@iS
The simpler the better, if you're going to use something that is not ideal.I just discovered this yesterday and I’m glad I didn’t have to come up with this:
https://semver.org/#is-there-a-suggested-regular-expression-...
My use case for it: https://github.com/typesense/typesense-website/blob/25562d02...
After I did this, the interviewer stopped me and told me in a negative way that he expected me to use a regex, which kinda shows he had no idea how a web browser works.
Unless it was a specific url with specific params?
Like, for just the params part (yes, broken and simplistic):
#!/usr/bin/perl
$_="a=b&c=d&e=f&whatever=some thing";
while (s/^([^&]*)=([^&]*)(&|$)//) {
print "[$1] [$2]\n";
}My imagined remedies are no 1:1 interviews and recording these sessions for "possible quality assurance and training purposes".
TLDR: it’s a traditional parser—a big state machine that steps through the URL character by character and tokenizes it into the relevant pieces.
And if they still insist even after the source of Chromium and FF has been consulted. Well then it’s time to leave. Don’t want to work with anyone like that.
u.checkURL = function (string) {
if ($.type(string) === "string") {
if (/^(https?|ftp):(\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*@)?((\[(|(v[\da-f]{1,}\.(([a-z]|\d|-|\.|_|~)|[!\$&'\(\)\*\+,;=]|:)+))\])|((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=])*)(:\d*)?)(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*|(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)?)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)){0})(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|\/|\?)\*)?$/i.test(string)) {
return true;
} else {
return false;
}
} else {
return false;
}
} if(predicate) {
return true;
} else {
return false;
}
Can just be written; return predicate;
So, your code above can be: return &.type(string) === “string” && /regex/i.test(string);
Reg-exes like this are truly hideous though, they may as well be written in Brainfuck for all their lack of maintainability and readability.I will never understand why regular expressions are considered the best tool for the job when it comes to parsing; they are far too terse, and do not declare intent in any way.
Software development is not just about communicating with the computer, it’s about communicating with other engineers, so we can work collaboratively. Regular expressions are the antithesis to that way of thinking
const isValidUrl = urlString => {
try {
new URL(urlString);
return true;
} catch {
return false;
}
};I know this is just a cutesy slogan, but how could you possibly know whether a living creature is a finite state machine? What would it even mean? I know I don't respond identically to identical stimuli presented on different occasions ….