In search of the perfect URL validation regex (2010) (opens in new tab)

(mathiasbynens.be)

152 pointsJonhoo4y ago64 comments

64 comments

Even if you built a URL validation regex that follows rfc3986[1] and rfc3987[2], you will still get user bug reports because web browsers follow a different standard.

For example, <http://example.com./> , <http:///example.com/> and <https://en.wikipedia.org/wiki/Space (punctuation)> are classified as invalid urls in the blog, but they are accepted in the browser.

As the creator of cURL puts it, there is no URL standard[3].

[1]: https://www.ietf.org/rfc/rfc3986.txt

[2]: https://www.ietf.org/rfc/rfc3987.txt

[3]: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/

yyyk4y ago

<http://example.com./> is a valid URL, see for example:

https://jdebp.uk/FGA/web-fully-qualified-domain-name.html

MildlySerious4y ago

Tangentially, Youtube had a bug surface last year where adding that extra dot let you avoid all ads. Previous discussion[1]

[1] https://news.ycombinator.com/item?id=23479435

userbinator4y ago

This "bug", can definitely also be known as a feature ;-)

dhsysusbsjsi4y ago

Also nearly every paywalled media site

Sephr4y ago

There might not have been a generally accepted standard then, but there is now: https://url.spec.whatwg.org/

jt21904y ago

There's also a question of what we're really trying to validate, IMHO. All of these regex patterns will tell you that a string looks like a URL, but they won't actually tell you if: There's any web server listening at that particular URL; Whether that server has the resource in that location; If that server is reachable from where you want to fetch it; etc.

staticassertion4y ago

> All of these regex patterns will tell you that a string looks like a URL,

yeah that's it that's what they're trying to validate

MPSimmons4y ago

It seems like the answer is almost always yes.

maciejgryka4y ago

Using https://regex.help/, I got this beauty which passes all the ones, which should pass. Obviously some room for improvement ;) But it works!

  ^(?:http(?:(?:://(?:(?:(?:code\.google\.com/events/#&product=browser|\-\.~_!\$&'\(\)\*\+,;=:%40:80%2f::::::@ex\.com|foo\.(?:bar/\?q=Test%20URL\-encoded%20stuff|com/(?:\(something\)\?after=parens|unicode_\(\)_in_parens|b_(?:\(wiki\)(?:_blah)?#cite\-1|b(?:_\(wiki\)_\(again\)|/))))|uid(?::password@ex\.com(?::8080)?/|@ex\.com(?::8080)?/)|www\.ex\.com/wpstyle/\?p=364|223\.255\.255\.254|उदाहरण\.परीक्षा|1(?:42\.42\.1\.1/|337\.net)|مثال\.إختبار|df\.ws/123|a\.b\-c\.de|\.ws/䨹|⌘\.ws/|例子\.测试|j\.mp)|142\.42\.1\.1:8080/)|\.damowmow\.com/)|s://(?:www\.ex\.com/foo/\?bar=baz&inga=42&quux|foo_bar\.ex\.com/))|://(?:uid(?::password@ex\.com(?::8080)?|@ex\.com(?::8080)?)|foo\.com/b_b(?:_\(wiki\))?|⌘\.ws))|ftp://foo\.bar/baz)$

I had to replace some words with shorter ones to squeeze under 1000 char limit and there's no way to provide negative examples right now. Something to fix!

axiosgunnar4y ago

> ⌘\.ws

I guess this is the regex equivalent of overfitting :)

saghm4y ago

Yeah, not to mention "code.google.com" being right in there!

maciejgryka4y ago

Yeah, grex (the library powering this) is really cool, but doesn’t generalize very well. I’m sure there are ways to improve it, but it’s not a trivial thing to do.

prox4y ago

Any sufficiently advanced technology is indistinguishable from magic.

This kind of feels like a magic spell :)

dang4y ago

Two past discussions, for the curious:

In search of the perfect URL validation regex - https://news.ycombinator.com/item?id=10019795 - Aug 2015 (77 comments)

In search of the perfect URL validation regex - https://news.ycombinator.com/item?id=7928968 - June 2014 (81 comments)

Sephr4y ago

> Assume that this regex will be used for a public URL shortener written in PHP, so URLs like http://localhost/, //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and tel:+1234567890 shouldn’t pass (even though they’re technically valid)

At Transcend, we need to allow site owners to regulate any arbitrary network traffic, so our data flow input UI¹ was designed to detect all valid hosts (including local hosts, IDN, IPv6 literal addresses, etc) and URLs (host-relative, protocol-relative, and absolute). If the site owner inputs content that is not a valid host or URL, then we treat their input as a regex.

I came up with these simple utilities built on top of the URL interface standard² to detect all valid hosts & URLs:

• isValidHost: https://gist.github.com/eligrey/6549ad0a635fa07749238911b429...

Example valid inputs:

  host.example
  はじめよう.みんな (IDN domain; xn--p8j9a0d9c9a.xn--q9jyb4c)
  [::1] (IPv6 address)
  0xdeadbeef (IPv4 address; 222.173.190.239)
  123.456 (IPv4 address; 123.0.1.200)
  123456789 (IPv4 address; 7.91.205.21)
  localhost

• isValidURL (and isValidAbsoluteURL): https://gist.github.com/eligrey/443d51fab55864005ffb3873204b...

Example valid inputs to isValidURL:

  https://absolute-url.example
  //relative-protocol.example
  /relative-path-example

1. https://docs.transcend.io/docs/configuring-data-flows

2. https://developer.mozilla.org/en-US/docs/Web/API/URL

mercora4y ago

while not terribly important or outright not required this fails (treats urls as regex) for link-local addresses with device identifier (zone-id) applied like "[fe80::8caa:8cff:fe80:ff32%eth0]" although that would need to be fixed in the standard if its desired :)

i've found some reasoning[0] as to why its not supported with browsers in mind though.

[0] https://www.w3.org/Bugs/Public/show_bug.cgi?id=27234#c2

dmurray4y ago

> I also don’t want to allow every possible technically valid URL — quite the opposite.

Well, that should make things a lot easier. What does he mean here? The rest of the text doesn't make it clear to me, unless it's meant to be "every possibly valid HTTP, HTTPS, or FTP URL" which isn't exactly "the opposite".

lelandfe4y ago

The next paragraph might be that clarification, although I agree it isn't totally clear what he meant there:

azalemeth4y ago

Honest question: there is a famous and very funny stack exchange answer on the topic of parsing html with a regex [1] that states that the problem is in general impossible and if if you find yourself doing this, something has gone wrong and you should re-evaluate your life choices / pray to Cthulu.

So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong. Are URLs describable in a Chomsky Type 3 grammar? Are they sufficiently regular that using a Regex is sensible? What do the actual browsers do?

[1] https://stackoverflow.com/questions/1732348/regex-match-open...

aranchelk4y ago

Caveats: I know nothing of Chomsky Grammars, and I have only a passing familiarity with Cthulu, but IMO the real crux of the issue parsing html with regex (beyond all the “it’s hard”, “the spec is more complicated than you think”, “regex is impossible to read” etc.) is html is a recursive data structure, e.g. you can have a div, inside a div, inside a div ad infinitum. Regex, AFAIK, doesn’t allow you to describe recursion, so you’re left with regex plus supporting code. You’ll then have an impedance mismatch between the two.

URLs are not recursive structures, so I’d say the single hardest feature of html is not present.

prox4y ago

The times I had to use it on HTML , I think I combined xPath with RegEx to close the mismatch.

MathMonkeyMan4y ago

I haven't looked at the BNF(s) for URIs lately, but I don't recall there being any recursion, so I wouldn't be surprised if the language were regular.

There was a Perl program that would take something like a BNF and barf out a gigantic regex (maybe with some maximum depth).

tester7564y ago

>So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong

Yes, if your regex is above {.../50/100/...} characters, then write parser.

I struggle to understand why do people write those crazy regexes for emails, urls, html when probably in all popular technologies there are battle-tested parsers for those things.

Too4y ago

On top of this the error messages with a regex will be very one-dimensional.

As an example, http://localhost/ is technically valid url, which he wants to block. Should this error say misformatted URL like all others?

Using regex to cover all such cases is really the wrong tool for the job.

staticassertion4y ago

Sometimes you're given an arbitrary bag of bytes with best-effort well-formed data. Regexes are gross but quite good for those cases where you need to try to rip out some bits from the data abyss.

gregsadetsky4y ago

I was just struggling with this -- specifically, our users' "UX" expectation that entering "example.com" should work when asked for their website URL.

Most URL validation rules/regex/librairies/etc. reject "example.com". However, if you head over to Stripe (for example), in the account settings, when asked for your company's URL, Stripe will accept "example.com", and assume "http://" as the prefix (which yes, can have its own problems)

What's a good solution? I both want to validate URLs, but also let users enter "example.com". But if I simply do

    if(validateURL(url)) {
      return true;
    } else if(validateURL("http://" + url)) {
      return true;
    } else {
      return false;
    }

i.e. validate the given URL, and as a fallback, try to validate "http://" + the given url, that opens the door to weird, non-URLs strings being incorrectly validated...

Help :-)

anderskaseorg4y ago

Parse, don’t validate. If you need a heuristic that accepts non-URL strings as if they were valid URLs, you should convert those non-URL strings to valid URLs so the rest of your code can just deal with valid URLs.

    if (validateURL(url)) {
      return url;
    } else if (validateURL("http://" + url)) {
      return "http://" + url;
    } else {
      return null;
    }

JadeNB4y ago

I know we're not golfing, but it pains me to see that repetition in the middle. Mightn't we write

    if (!validateURL(url)) {
        url = "http://" + url;
        if (!validateURL(url)) {
            url = null;
        }
    }
    return url;

to snip a small probability of a bug?

wolfgang424y ago

I find that branchiness (and mutation of the variable) harder to follow. Personally, I’d just take “parse, don’t validate” to its logical conclusion and go for:

    const parseUrl = url => validateUrl(url) ? url : null;
    return parseUrl(url) || parseUrl('http://'+url) || null;

lelandfe4y ago

Address validators for online checkout are notoriously inaccurate, though they still help a lot. You just have to prompt the user, "Did you mean 123 Example St?"

I'd probably do the same for poorly formatted URLs. When the user hits Submit, a prompt appears saying, "Did you mean `https://example.com`?"

dilatedmind4y ago

i would suggest bias your implementation against false negatives. They can always come back and update it if it's wrong, and their url could just as easily be "valid" but incorrect, eg any typo in a domain name.

if it's really important, you could try making a request to the url and see if it loads, but that still doesn't validate its the url they intended to input.

might be cool to load the url with puppeteer and capture a screenshot of the page. if they can't recognize their own website, it's on them.

Rebelgecko4y ago

This could potentially be abused, but you could actually try to resolve the DNS to determine if it's valid (could be weird for some cases like localhost or IP addresses). Or just do a "curl https://whatever.com" and see what happens (assuming that all of the websites are running a webserver, although idk if that is true in your situation)

dmix4y ago

@stephenhay seems to be the winner here if you don't need IP addresses (or weird dashed URLS). It's only 38 characters long and easy to understand.

    @^(https?|ftp)://[^\s/$.?#].[^\s]*$@iS

The simpler the better, if you're going to use something that is not ideal.

loloquwowndueo4y ago

Doesn’t cover mailto: which is fairly common. To be pedantic/strict, mailto: are URIs not URLs.

jabo4y ago

Tangentially related, but mentioning to hopefully save someone time: if you ever find yourself wanting to check if a version string is semver or not, before inventing your own, there is an official regex that’s provided.

I just discovered this yesterday and I’m glad I didn’t have to come up with this:

https://semver.org/#is-there-a-suggested-regular-expression-...

My use case for it: https://github.com/typesense/typesense-website/blob/25562d02...

Thorrez4y ago

The rules here don't make sense to me. http://223.255.255.254 must be allowed and http://10.1.1.1 must not. This is to provide security for the 10.0.0.0/8 range? This doesn't do that, because foo.com could resolve to 10.1.1.1 .

mpeg4y ago

I was once failed on a technical interview, partly because on the coding test I was asked to write a url parser "from scratch, the way a browser would do it" and I explained it would take way too long to account for every edge case in the URL RFC but that I could do a quick and dirty approach for common urls.

After I did this, the interviewer stopped me and told me in a negative way that he expected me to use a regex, which kinda shows he had no idea how a web browser works.

elif4y ago

how would you even "parse" a url with a regex? dynamically defined named subpatterns for each url parameter? I think the best i could do on paper with a regex is say "yup this is a url" or maybe "yup i can count the number of params"

Unless it was a specific url with specific params?

chrismorgan4y ago

Match groups so you can split it up into scheme, username, password, host, port, path, query, fragment. Not difficult to approximate, though for best results with diverse schemes you’d want an engine that allows repeated named groups, and I don’t know if any do (JavaScript and Python don’t).

powersnail4y ago

Python's `regex` package does allow repeated named group.

elif4y ago

I mean ya that would match a query string, but it wouldn't parse it?

tyingq4y ago

I assume they meant "some regex implementation, including replace and/or match groups".

Like, for just the params part (yes, broken and simplistic):

  #!/usr/bin/perl
  $_="a=b&c=d&e=f&whatever=some thing";
  while (s/^([^&]*)=([^&]*)(&|$)//) {
    print "[$1] [$2]\n";
  }

specialist4y ago

Ya. I've also suffered copypasta trials administered by bar raisers, mensa members, and other self appointed keepers of the sacred nerd flame.

My imagined remedies are no 1:1 interviews and recording these sessions for "possible quality assurance and training purposes".

axiosgunnar4y ago

How do browsers parse URLs then?

djur4y ago

There's actually a standard for it these days.

https://url.spec.whatwg.org/#url-parsing

wolfgang424y ago

Here’s a polyfill for the JS URL() interface which should give you a taste: https://github.com/zloirock/core-js/blob/272ac1b4515c5cfbf34... (I tried finding the one in Firefox but I couldn’t actually work out where it started, this one is much easier to follow)

TLDR: it’s a traditional parser—a big state machine that steps through the URL character by character and tokenizes it into the relevant pieces.

jhgb4y ago

Did you point out that his two requirements were contradictory?

mercora4y ago

its not very likely this is whats happening here but i feel like this could be done on purpose to see how you act in this kind of situation. it kinda tells how you would act once you inevitably go into a conflict with colleagues arguing over stuff like that.

tapland4y ago

Or it could be one of those outsourced interviews.

codetrotter4y ago

In that case I think the proper response should be: “I am very sure that browsers don’t do it that way. But let’s have a look.” And then pull up the source code for Chromium and Firefox. Assuming it’s not whiteboard only.

And if they still insist even after the source of Chromium and FF has been consulted. Well then it’s time to leave. Don’t want to work with anyone like that.

axiosgunnar4y ago

How do browsers parse URLs then?

1 more reply

gibsonf14y ago

I use this:

u.checkURL = function (string) {

    if ($.type(string) === "string") {

        if (/^(https?|ftp):(\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*@)?((\[(|(v[\da-f]{1,}\.(([a-z]|\d|-|\.|_|~)|[!\$&'\(\)\*\+,;=]|:)+))\])|((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=])*)(:\d*)?)(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*|(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)?)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)){0})(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|\/|\?)\*)?$/i.test(string)) {

            return true;

        } else {

            return false;

        }

    } else {

        return false;

    }

}

louthy4y ago

    if(predicate) {
        return true;
    } else {
        return false;
    }

Can just be written;

    return predicate;

So, your code above can be:

    return &.type(string) === “string” && /regex/i.test(string);

Reg-exes like this are truly hideous though, they may as well be written in Brainfuck for all their lack of maintainability and readability.

I will never understand why regular expressions are considered the best tool for the job when it comes to parsing; they are far too terse, and do not declare intent in any way.

Software development is not just about communicating with the computer, it’s about communicating with other engineers, so we can work collaboratively. Regular expressions are the antithesis to that way of thinking

toomanybeersies4y ago

If you're writing browser JS, just use the URL builtin

  const isValidUrl = urlString => {
    try {
      new URL(urlString);
      return true;
    } catch {
      return false;
    }
  };

DoctorOW4y ago

Here's the least imperfect Regex with Unit Tests on Regex101: https://regex101.com/r/IqI7KW/2

ylee4y ago

Can someone convert diegoperini's regex into a form compatible with Emacs Lisp? I freely admit to this being beyond my brainpower.

0xdeadb00f4y ago

No validation of hex encoded IPv4? Or did I just miss it on my quick scroll through.

queuebert4y ago

Uh oh, Regex is approaching sentience.

MaxBarraclough4y ago

Every known sentient being is a finite state machine. Every finite state machine corresponds to a regular expression, and vice versa.

JadeNB4y ago

> Every known sentient being is a finite state machine.

I know this is just a cutesy slogan, but how could you possibly know whether a living creature is a finite state machine? What would it even mean? I know I don't respond identically to identical stimuli presented on different occasions ….

MaxBarraclough4y ago

> I know this is just a cutesy slogan

Mostly, yes, but I do think there's a real point here as well.

> how could you possibly know whether a living creature is a finite state machine?

As I understand it, physicists don't really know whether the physical world has a finite number of states, or an infinite number. I think they tend to lean toward finite, though.

Even if it's infinite, I doubt it's of consequence. That is to say, I doubt that sentience depends on the physical possibility of an infinite number of states. (Of course, if it turns out the physical world only has a finite number of states, that demonstrates that sentience is compatible with the finite-states constraint.)

> What would it even mean?

Systems can be modelled as finite state machines. Sentient entities like people are extremely sophisticated systems, but that's just a matter of degree, not of category.

> I know I don't respond identically to identical stimuli presented on different occasions

Right, because you're in a different state. You'll never be in the same state twice. We don't need to resort to non-determinism.

throwamon4y ago

Obnoxious, I mean, trivial, answer: Just make "occasions" a variable. Assuming your lifetime is finite, you could simply assign each point in time to a value, and there you have it: a finite mapping from each moment to a state.

j / k navigate · click thread line to collapse

64 comments

likium4y ago

Even if you built a URL validation regex that follows rfc3986[1] and rfc3987[2], you will still get user bug reports because web browsers follow a different standard.

For example, <http://example.com./> , <http:///example.com/> and <https://en.wikipedia.org/wiki/Space (punctuation)> are classified as invalid urls in the blog, but they are accepted in the browser.

As the creator of cURL puts it, there is no URL standard[3].

[1]: https://www.ietf.org/rfc/rfc3986.txt

[2]: https://www.ietf.org/rfc/rfc3987.txt

[3]: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/

yyyk4y ago

<http://example.com./> is a valid URL, see for example:

https://jdebp.uk/FGA/web-fully-qualified-domain-name.html

MildlySerious4y ago

Tangentially, Youtube had a bug surface last year where adding that extra dot let you avoid all ads. Previous discussion[1]

[1] https://news.ycombinator.com/item?id=23479435

userbinator4y ago

This "bug", can definitely also be known as a feature ;-)

dhsysusbsjsi4y ago

Also nearly every paywalled media site

Sephr4y ago

There might not have been a generally accepted standard then, but there is now: https://url.spec.whatwg.org/

jt21904y ago

staticassertion4y ago

> All of these regex patterns will tell you that a string looks like a URL,

yeah that's it that's what they're trying to validate

MPSimmons4y ago

It seems like the answer is almost always yes.

maciejgryka4y ago

Using https://regex.help/, I got this beauty which passes all the ones, which should pass. Obviously some room for improvement ;) But it works!

  ^(?:http(?:(?:://(?:(?:(?:code\.google\.com/events/#&product=browser|\-\.~_!\$&'\(\)\*\+,;=:%40:80%2f::::::@ex\.com|foo\.(?:bar/\?q=Test%20URL\-encoded%20stuff|com/(?:\(something\)\?after=parens|unicode_\(\)_in_parens|b_(?:\(wiki\)(?:_blah)?#cite\-1|b(?:_\(wiki\)_\(again\)|/))))|uid(?::password@ex\.com(?::8080)?/|@ex\.com(?::8080)?/)|www\.ex\.com/wpstyle/\?p=364|223\.255\.255\.254|उदाहरण\.परीक्षा|1(?:42\.42\.1\.1/|337\.net)|مثال\.إختبار|df\.ws/123|a\.b\-c\.de|\.ws/䨹|⌘\.ws/|例子\.测试|j\.mp)|142\.42\.1\.1:8080/)|\.damowmow\.com/)|s://(?:www\.ex\.com/foo/\?bar=baz&inga=42&quux|foo_bar\.ex\.com/))|://(?:uid(?::password@ex\.com(?::8080)?|@ex\.com(?::8080)?)|foo\.com/b_b(?:_\(wiki\))?|⌘\.ws))|ftp://foo\.bar/baz)$

I had to replace some words with shorter ones to squeeze under 1000 char limit and there's no way to provide negative examples right now. Something to fix!

axiosgunnar4y ago

> ⌘\.ws

I guess this is the regex equivalent of overfitting :)

saghm4y ago

Yeah, not to mention "code.google.com" being right in there!

maciejgryka4y ago

Yeah, grex (the library powering this) is really cool, but doesn’t generalize very well. I’m sure there are ways to improve it, but it’s not a trivial thing to do.

prox4y ago

Any sufficiently advanced technology is indistinguishable from magic.

This kind of feels like a magic spell :)

dang4y ago

Two past discussions, for the curious:

In search of the perfect URL validation regex - https://news.ycombinator.com/item?id=10019795 - Aug 2015 (77 comments)

In search of the perfect URL validation regex - https://news.ycombinator.com/item?id=7928968 - June 2014 (81 comments)

Sephr4y ago

I came up with these simple utilities built on top of the URL interface standard² to detect all valid hosts & URLs:

• isValidHost: https://gist.github.com/eligrey/6549ad0a635fa07749238911b429...

Example valid inputs:

  host.example
  はじめよう.みんな (IDN domain; xn--p8j9a0d9c9a.xn--q9jyb4c)
  [::1] (IPv6 address)
  0xdeadbeef (IPv4 address; 222.173.190.239)
  123.456 (IPv4 address; 123.0.1.200)
  123456789 (IPv4 address; 7.91.205.21)
  localhost

• isValidURL (and isValidAbsoluteURL): https://gist.github.com/eligrey/443d51fab55864005ffb3873204b...

Example valid inputs to isValidURL:

  https://absolute-url.example
  //relative-protocol.example
  /relative-path-example

1. https://docs.transcend.io/docs/configuring-data-flows

2. https://developer.mozilla.org/en-US/docs/Web/API/URL

mercora4y ago

i've found some reasoning[0] as to why its not supported with browsers in mind though.

[0] https://www.w3.org/Bugs/Public/show_bug.cgi?id=27234#c2

dmurray4y ago

> I also don’t want to allow every possible technically valid URL — quite the opposite.

lelandfe4y ago

The next paragraph might be that clarification, although I agree it isn't totally clear what he meant there:

azalemeth4y ago

[1] https://stackoverflow.com/questions/1732348/regex-match-open...

aranchelk4y ago

URLs are not recursive structures, so I’d say the single hardest feature of html is not present.

prox4y ago

The times I had to use it on HTML , I think I combined xPath with RegEx to close the mismatch.

MathMonkeyMan4y ago

I haven't looked at the BNF(s) for URIs lately, but I don't recall there being any recursion, so I wouldn't be surprised if the language were regular.

There was a Perl program that would take something like a BNF and barf out a gigantic regex (maybe with some maximum depth).

tester7564y ago

>So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong

Yes, if your regex is above {.../50/100/...} characters, then write parser.

I struggle to understand why do people write those crazy regexes for emails, urls, html when probably in all popular technologies there are battle-tested parsers for those things.

Too4y ago

On top of this the error messages with a regex will be very one-dimensional.

As an example, http://localhost/ is technically valid url, which he wants to block. Should this error say misformatted URL like all others?

Using regex to cover all such cases is really the wrong tool for the job.

staticassertion4y ago

Sometimes you're given an arbitrary bag of bytes with best-effort well-formed data. Regexes are gross but quite good for those cases where you need to try to rip out some bits from the data abyss.

gregsadetsky4y ago

I was just struggling with this -- specifically, our users' "UX" expectation that entering "example.com" should work when asked for their website URL.

What's a good solution? I both want to validate URLs, but also let users enter "example.com". But if I simply do

    if(validateURL(url)) {
      return true;
    } else if(validateURL("http://" + url)) {
      return true;
    } else {
      return false;
    }

i.e. validate the given URL, and as a fallback, try to validate "http://" + the given url, that opens the door to weird, non-URLs strings being incorrectly validated...

Help :-)

anderskaseorg4y ago

    if (validateURL(url)) {
      return url;
    } else if (validateURL("http://" + url)) {
      return "http://" + url;
    } else {
      return null;
    }

JadeNB4y ago

I know we're not golfing, but it pains me to see that repetition in the middle. Mightn't we write

    if (!validateURL(url)) {
        url = "http://" + url;
        if (!validateURL(url)) {
            url = null;
        }
    }
    return url;

to snip a small probability of a bug?

wolfgang424y ago

I find that branchiness (and mutation of the variable) harder to follow. Personally, I’d just take “parse, don’t validate” to its logical conclusion and go for:

    const parseUrl = url => validateUrl(url) ? url : null;
    return parseUrl(url) || parseUrl('http://'+url) || null;

lelandfe4y ago

Address validators for online checkout are notoriously inaccurate, though they still help a lot. You just have to prompt the user, "Did you mean 123 Example St?"

I'd probably do the same for poorly formatted URLs. When the user hits Submit, a prompt appears saying, "Did you mean `https://example.com`?"

dilatedmind4y ago

if it's really important, you could try making a request to the url and see if it loads, but that still doesn't validate its the url they intended to input.

might be cool to load the url with puppeteer and capture a screenshot of the page. if they can't recognize their own website, it's on them.

Rebelgecko4y ago

dmix4y ago

@stephenhay seems to be the winner here if you don't need IP addresses (or weird dashed URLS). It's only 38 characters long and easy to understand.

    @^(https?|ftp)://[^\s/$.?#].[^\s]*$@iS

The simpler the better, if you're going to use something that is not ideal.

loloquwowndueo4y ago

Doesn’t cover mailto: which is fairly common. To be pedantic/strict, mailto: are URIs not URLs.

jabo4y ago

I just discovered this yesterday and I’m glad I didn’t have to come up with this:

https://semver.org/#is-there-a-suggested-regular-expression-...

My use case for it: https://github.com/typesense/typesense-website/blob/25562d02...

Thorrez4y ago

mpeg4y ago

After I did this, the interviewer stopped me and told me in a negative way that he expected me to use a regex, which kinda shows he had no idea how a web browser works.

elif4y ago

Unless it was a specific url with specific params?

chrismorgan4y ago

powersnail4y ago

Python's `regex` package does allow repeated named group.

elif4y ago

I mean ya that would match a query string, but it wouldn't parse it?

tyingq4y ago

I assume they meant "some regex implementation, including replace and/or match groups".

Like, for just the params part (yes, broken and simplistic):

  #!/usr/bin/perl
  $_="a=b&c=d&e=f&whatever=some thing";
  while (s/^([^&]*)=([^&]*)(&|$)//) {
    print "[$1] [$2]\n";
  }

specialist4y ago

Ya. I've also suffered copypasta trials administered by bar raisers, mensa members, and other self appointed keepers of the sacred nerd flame.

My imagined remedies are no 1:1 interviews and recording these sessions for "possible quality assurance and training purposes".

axiosgunnar4y ago

How do browsers parse URLs then?

djur4y ago

There's actually a standard for it these days.

https://url.spec.whatwg.org/#url-parsing

wolfgang424y ago

TLDR: it’s a traditional parser—a big state machine that steps through the URL character by character and tokenizes it into the relevant pieces.

jhgb4y ago

Did you point out that his two requirements were contradictory?

mercora4y ago

tapland4y ago

Or it could be one of those outsourced interviews.

codetrotter4y ago

And if they still insist even after the source of Chromium and FF has been consulted. Well then it’s time to leave. Don’t want to work with anyone like that.

axiosgunnar4y ago

How do browsers parse URLs then?

1 more reply

gibsonf14y ago

I use this:

u.checkURL = function (string) {

    if ($.type(string) === "string") {

        if (/^(https?|ftp):(\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*@)?((\[(|(v[\da-f]{1,}\.(([a-z]|\d|-|\.|_|~)|[!\$&'\(\)\*\+,;=]|:)+))\])|((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=])*)(:\d*)?)(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*|(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)?)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)){0})(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|\/|\?)\*)?$/i.test(string)) {

            return true;

        } else {

            return false;

        }

    } else {

        return false;

    }

}

louthy4y ago

    if(predicate) {
        return true;
    } else {
        return false;
    }

Can just be written;

    return predicate;

So, your code above can be:

    return &.type(string) === “string” && /regex/i.test(string);

Reg-exes like this are truly hideous though, they may as well be written in Brainfuck for all their lack of maintainability and readability.

I will never understand why regular expressions are considered the best tool for the job when it comes to parsing; they are far too terse, and do not declare intent in any way.

toomanybeersies4y ago

If you're writing browser JS, just use the URL builtin

  const isValidUrl = urlString => {
    try {
      new URL(urlString);
      return true;
    } catch {
      return false;
    }
  };

DoctorOW4y ago

Here's the least imperfect Regex with Unit Tests on Regex101: https://regex101.com/r/IqI7KW/2

ylee4y ago

Can someone convert diegoperini's regex into a form compatible with Emacs Lisp? I freely admit to this being beyond my brainpower.

0xdeadb00f4y ago

No validation of hex encoded IPv4? Or did I just miss it on my quick scroll through.

queuebert4y ago

Uh oh, Regex is approaching sentience.

MaxBarraclough4y ago

Every known sentient being is a finite state machine. Every finite state machine corresponds to a regular expression, and vice versa.

JadeNB4y ago

> Every known sentient being is a finite state machine.

MaxBarraclough4y ago

> I know this is just a cutesy slogan

Mostly, yes, but I do think there's a real point here as well.

> how could you possibly know whether a living creature is a finite state machine?

As I understand it, physicists don't really know whether the physical world has a finite number of states, or an infinite number. I think they tend to lean toward finite, though.

> What would it even mean?

Systems can be modelled as finite state machines. Sentient entities like people are extremely sophisticated systems, but that's just a matter of degree, not of category.

> I know I don't respond identically to identical stimuli presented on different occasions

Right, because you're in a different state. You'll never be in the same state twice. We don't need to resort to non-determinism.

throwamon4y ago

j / k navigate · click thread line to collapse