lambda c: 3*len(matches(c, uncovered)) - len(c)
Here's a trivial way to explore it: say we generalize the heuristic to H(a, b). H(a,b) = lambda c: a*len(matches(c, uncovered)) - b*len(c)
The original heuristic is considered H(3,1) by this definition. Then we can play around with a and b to see if we'd get smaller results. def findregex_lambda(winners, losers, a, b):
"Find a regex that matches all winners but no losers (sets of strings)."
# Make a pool of candidate components, then pick from them to cover winners.
# On each iteration, add the best component to 'cover'; finally disjoin them together.
pool = candidate_components(winners, losers)
cover = []
while winners:
best = max(pool, key=lambda c: a*len(matches(c, winners)) - b*len(c))
cover.append(best)
pool.remove(best)
winners = winners - matches(best, winners)
return '|'.join(cover)
>>> findregex_lambda(starwars, startrek, 3, 1)
' T|E.P| N'
>>> findregex_lambda(starwars, startrek, 3, 2)
' T|B| N| M'
Or, to automate this: def best_H_heuristic(winners, losers):
d = {(a,b) : len(findregex_lambda(winners, losers, a,b)) for a in range(0,4) for b in range(0,4)}
return min(d, key=d.get)
>>> best_H_heuristic(starwars, startrek):
(3,1)
Looks like H(3,1) is pretty good for this case. What about the nfl teams? >>> best_H_heuristic(nfl_in, nfl_out)
(3, 2)
>>> findregex_lambda(nfl_in, nfl_out, 3, 1)
'pa|g..s|4|fs|sa|se|lt|os'
>>> findregex_lambda(nfl_in, nfl_out, 3, 2)
'pa|ch|4|e.g|sa|se|lt|os'
Not the best heuristic there. H(3,1) wins or ties for the boys/girls set, left/right set and drugs/cities set, which just goes to show you that picking a heuristic off a gut guess isn't such a bad approach.You could also explore heuristics of different forms:
M(a,b,d,e) = lambda c: a*len(matches(c, uncovered))^b - d*len(c)^e
Or trying completely different formats: L(a,b) = lambda c: a*log(len(matches(c, uncovered))) - b*len(c)1. The greedy algorithm has an O(log(n)) approximation ratio, meaning it produces a regex guaranteed to use a number of terms within a multiplicative O(log(n)) factor of the optimal regex.
2. Unless P != NP, set cover cannot be approximated better than the greedy algorithm. In other words, the only general solutions you'll find (unless you're using some special insight about how regular expressions cover sets of strings) will be no better than a constant factor improvement in produced regex size than the greedy algorithm.
That being said, regexes (esp disjunctions of small regexes) are not arbitrary sets. So this problem is a subset of set cover, and certainly may have efficient exact solutions.
winners, losers = (winners - losers), (losers - winners)http://codegolf.stackexchange.com/questions/17718/meta-regex...
That link includes a perl 10-liner to do the same.
1. Take an existing solution to finding the shortest regex given a list of inclusions an exclusions.
2. Take another unrelated arbitrary program.
3. Combine the two, when the arbitrary program terminates, use the existing solution to solve the problem.
4. Such a regex would have to solve the halting problem to know if the arbitrary program terminates and the existing solution solves the problem.
5. Since the halting problem cannot be solved, no such regex can exist.
EDIT: possibly down-voted because someone though it was sarcastic???
I was actually thinking of this problem before the XKCD comic, for detecting hashes on hardrives efficiently...
(note the inspiration was a nefarious regex scanner for finding bitcoin hashes. I have no intention of building such a thing, but the idea of a random detector regex intrigued me. Is it possible?)
My idea was more like calculate the statistics of character n-grams in english (all of which exist in random noise too), but count the most unlikely occurrences until you hit some probabilistic threshold that indicates some well thought out decision boundary.
If the former, you're possibly better just looking for high-entropy chunks of data of the right size. Of course, that'll match all kinds of encrypted data, but there may not be much you can do there.
key=lambda c: 3*len(matches(c, uncovered)) - len(c)Why the "3" bit, good question. Would be interesting to see what happens with other relative weights of number of matches and length.
The clue is how he explains all the simple parts of the script with comments, but leaves this part unexplained, making some of us feel left out of the cool club. Implying: if you don't grasp it immediately you must not be up to it.
Overall it's a great post, but as to the cryptic parts of it, it's disappointing to see this kind of thing on the part of someone we all look up to. It would be more generous of spirit if he were to have written this in a readable, self explanatory way, like the rest of the code... as python should be written.
Sigh. Norvig has a posse, but I thought it had to be said.
>>> import this
...
"Readability matters"
...
"Sparse is better than dense"
...
"If the implementation is hard to explain, it's a bad idea"Second, seriously?
A line of code that isn't as clear as it could be in an ipython notebook he probably dashed off in an hour is somehow a reflection on his generosity of spirit? I think you need to recheck your expectations.
Thus would fail against a 'must fail' target of abcabc, but then you fix that by extending the maximum allowable regex fragment length from 4 to 5 and it'll find ^abc$. More generally, you extend the maximum allowable regex count to the longest 'must match string' plus 2, and it'll always succeed, even if it has to create a regex consisting of ^word1$|^word2$|^word3$...
practical tradeoffs, but yes, his preferred language is Python.
Though it cares about the size of the AST rather than the concrete syntax. I can't try running it now, I'm on an iPad.
http://cstheory.stackexchange.com/questions/16860/minimizing...
Apparently this problem is still open.
I'm still not happy with my 214 on Alphabetical including one false match (I was 202 or something with everything correctly matched).
If so, yes, that's what a saved ipython notebook is. See: http://ipython.org/notebook.html for an overview of ipython notebooks.
You can export it to html, latex, etc using "ipython nbconvert --to <format_you_want>". See: http://ipython.org/ipython-doc/dev/interactive/notebook.html...
Basically, the website you're seeing (nbviewer) hosts ipython notebooks (json files) and converts them to static html for viewing.
Is there any nice bundle of everything you need for ipython, for the Mac, that's both easy to install and uninstall?
/M | [TN]|B/
is suboptimal, but could be / [TMN]|B/
But that (and the article) leaves out the subtitle for Star Trek 1: "The Motion Picture". For that, Randall's original expression works.Judging by the amount of fawning here, this guy must be a HN celebrity. Interesting post nevertheless!
I can only hope, one day, I'd be writing and publishing joyful little hacks like this, to such general applause, instead of eking out a living. I have to say I'm a bit envious here!
Well done to the dude. An inspirational post, in many ways.