Yeah, code search is a niche market numerically speaking, but intellectually and economically (considering the economic impact of software) it is vital. Google was doing so much better a job of it than anybody else that they completely owned the space. How could that not be worth what it cost them to run it? And now we are bereft of anything good.
I used to use Google Code Search like this: encounter an unfamiliar language or API construct, go to code search, get page after page of real-world usages. Or like this: wonder how to do X, imagine what code that does X might say, search for snippets of that until hitting a working example of X. It was a powerful learning device that I am sad to lose. I sure hope a startup comes along to get search right for the world's source code. Github ought to, but their search sucks.
In any case, congratulations Russ Cox et. al. on a brilliant piece of work.
Now, as it happens, I am in fact working on a very cool project at Google. [...] a project that aims to turn source code -- ALL source code -- from plain text into Wikipedia.
For a company that succeeded partly by leveraging the economic value of hackers in a way that hadn't been been done before, this decision is disturbingly out of character. It feels like something that must have happened for inward-facing political reasons - in other words, a sign of rot.
I get that Steve told Larry to focus, but "code" and "search" almost define focus in their case.
Edit: It seems the service is still available under a different URL. Weird, but I'll happily take it! http://news.ycombinator.com/item?id=3487950
Its still a long way from being close to Google code search both in terms of code indexed (amending that as I write this) but I hope to get things up-to a par as soon as I possibly can.
Symbolhound http://symbolhound.com/ also has a code index that's worth a look too.
Essentially I took the regex such as [cb]at and then expand it out to get all of the terms, in this case cat and bat and then do a standard fulltex search based on those terms to get as close a match as possible. I then loop over the results to get back those which are exact matches.
Its actually more complex then that but with some other trickery (infix and prefix indexing) but it works reasonably well although I am still ironing out some kinks.
What does the wildcard match "foo.*bar" expand to?
How do you handle those pathological cases like "f.*r"?
Edit: HN can't displaying the wildcard char. Reformat it.It is less likely to pick up things like "fooabar" but assuming there are still results like "foomanchu bar" they will be found. Also assuming the default for your proximity search is OR logic you should still pick up "foobar" eventually.
As for the other case you will naturally find all sorts of things that match. But as with the method in the article the more information you give it the closer a match you will find.
I don't know if the best/worst case is any better then the linked but it does work reasonably well.
Could easily be added as a modifier (see `man perlre`), but should be implemented as two to enable explicit behavior and toggling the default. Randomly picking the letter N:
/(\w+) \1/n # Error: look-behind is incompatible with linear runtime RE engine
/(\w+) \1/N # Works!
/(\w+) \1/ # Preferable works for backwards compat, maybe overridden by an ENV var> Regular expression matches do not always line up nicely on word boundaries, so the inverted index cannot be based on words like in the previous example. Instead, we can use an old information retrieval trick and build an index of n-grams, substrings of length n. This sounds more general than it is. In practice, there are too few distinct 2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is.
As he explains, in order to perform a search using regular expressions, the RE is first converted to an "ordinary" query string, which finds a set of candidate documents; the candidate documents are then loaded in memory, and the "real" regular expression run against them, in order to find only the matching documents.
He used Google's retrieval engine in order to build the trigram index, but he doesn't say how he identified "code" amidst the ocean of ordinary web pages?
He does say this regarding an example implementation:
> The indexer (...) rejects files that are unlikely to be interesting, such as those that (...) have very long lines, or that have a very large number of distinct trigrams.
so maybe that's what Code Search did too.
What I'm wondering is this: wouldn't it be interesting to have a full web index based on trigrams, that would let us search not only using RE but also using wildcards (at the end of words or at the beginning)?
Maybe it would be too complex to build such an index for the whole web, but for limited corpora (such as one's mail) it would be very useful.
However, many finite-state automata regex implementations have existed for years (e.g. Java http://cs.au.dk/~amoeller/automaton) without the backtracking feature, of course. Also of interest is the benchmark data at: http://tusker.org/regex/regex_benchmark.html
If you read his write-up on RegEx matching, you'll see notes that Thompson wrote an implementation in the mid-60s, so he definitely doesn't claim they're new. What he does claim is that most regex libraries don't use them, even when the regex they're matching to doesn't require backtracking.
The original basic RE and extended RE (when backreferencing is not used) are significantly faster than implementations that most programmers traditionally rave about, e.g., Perl RE.
Tell me something I didn't know.
He thus used such 30 year old code as a model and easily topped the speeds of the built-in RE capabilities of today's popular scripting languages.
Common sense is underrated.
It sounds like you are replying to regexp1.html, not regexp4.html.
My guess is that regular expression search is not as useful as full-text search that general Google Search does.
Yes, General Google Search is missing some neat features, but overall these features are not as important as convenience of using familiar general search queries, search speed, and the size of general google search index.
BTW, do you have your own explanation of why Google Code Search was cancelled?