How Google Code Search Worked (opens in new tab)

(swtch.com)

287 pointsbasugasubaku14y ago45 comments

45 comments

Google made a mistake in killing code search. Indexing the world's source code and making it searchable is so obviously part of their core mission that I wonder how this decision even got made.

Yeah, code search is a niche market numerically speaking, but intellectually and economically (considering the economic impact of software) it is vital. Google was doing so much better a job of it than anybody else that they completely owned the space. How could that not be worth what it cost them to run it? And now we are bereft of anything good.

I used to use Google Code Search like this: encounter an unfamiliar language or API construct, go to code search, get page after page of real-world usages. Or like this: wonder how to do X, imagine what code that does X might say, search for snippets of that until hitting a working example of X. It was a powerful learning device that I am sad to lose. I sure hope a startup comes along to get search right for the world's source code. Github ought to, but their search sucks.

In any case, congratulations Russ Cox et. al. on a brilliant piece of work.

pshc14y ago

From Steve Yegge's most recent blog rant:

Now, as it happens, I am in fact working on a very cool project at Google. [...] a project that aims to turn source code -- ALL source code -- from plain text into Wikipedia.

gruseom14y ago

Ah. Well, the decision to kill Code Search would make sense if it were in favour of something better. But then why kill it now and leave nothing for any length of time? Also, there's no guarantee the new thing will turn out to actually be better.

For a company that succeeded partly by leveraging the economic value of hackers in a way that hadn't been been done before, this decision is disturbingly out of character. It feels like something that must have happened for inward-facing political reasons - in other words, a sign of rot.

I get that Steve told Larry to focus, but "code" and "search" almost define focus in their case.

Edit: It seems the service is still available under a different URL. Weird, but I'll happily take it! http://news.ycombinator.com/item?id=3487950

1 more reply

boyter14y ago

You mentioned hoping someone can come along to get this right. I am certainly trying with http://searchco.de/

Its still a long way from being close to Google code search both in terms of code indexed (amending that as I write this) but I hope to get things up-to a par as soon as I possibly can.

Symbolhound http://symbolhound.com/ also has a code index that's worth a look too.

jauco14y ago

That's a seriously cool website. You're obviously still developing it, but way to go!

1 more reply

dncrane14y ago

SymbolHound developer here, thanks for the mention boyter!

1 more reply

inerte14y ago

Did it stop working internally? Do Google employees still get to use it?

17814y ago

http://googlesystem.blogspot.com/2012/01/google-code-search-... —» http://code.google.com/codesearch

rsc14y ago

In the long term, this will be search only over code.google.com projects, as mentioned in the 'discussion' link on codesearch.google.com.

1 more reply

singular14y ago

I sometimes wonder whether Russ Cox is actually human or is in fact a collective pen name for a group of very talented hackers :)

cdibona14y ago

You're not alone. I like to think of him as 'Future japan prize winner Russ Cox'

boyter14y ago

Interesting solution. I did something a little different for searchco.de when I was implementing regex search.

Essentially I took the regex such as [cb]at and then expand it out to get all of the terms, in this case cat and bat and then do a standard fulltex search based on those terms to get as close a match as possible. I then loop over the results to get back those which are exact matches.

Its actually more complex then that but with some other trickery (infix and prefix indexing) but it works reasonably well although I am still ironing out some kinks.

ww52014y ago

Interesting.

  What does the wildcard match "foo.*bar" expand to?
  How do you handle those pathological cases like "f.*r"?

Edit: HN can't displaying the wildcard char. Reformat it.

boyter14y ago

Much of the same. Most indexers these days allow a proximity search. So you can expand out "foo.bar" into the search "foo << *bar".

It is less likely to pick up things like "fooabar" but assuming there are still results like "foomanchu bar" they will be found. Also assuming the default for your proximity search is OR logic you should still pick up "foobar" eventually.

As for the other case you will naturally find all sorts of things that match. But as with the method in the article the more information you give it the closer a match you will find.

I don't know if the best/worst case is any better then the linked but it does work reasonably well.

bri3d14y ago

This is the most awesome kind of solution: built off of few mostly-off-the-shelf moving parts, simple and easy to understand, and entirely perfect for the problem at hand. This write-up would be awesome teaching material for anyone moving from "I know how to write a program" to "how do I build a clean, elegant system to solve a specific problem?"

arnsholt14y ago

To be fair, the reason Perl, Python and PCRE (which all use pretty much the same regex syntax) don't use the linear-time algorithms is because they can't. Features like look-around and matching what you've already matched (/(\w+) \1/ to find repeated words for example) give you more expressivity than regular languages, but also takes away linear time algorithms.

fasdg14y ago

As Russ points out in his earlier re2-related blog posts, these regex engines still perform non-linearly on inputs which don't involve look-around, look-behind, etc. There's plenty of room for improvement even if they want to keep these features.

snprbob8614y ago

Seems like the default should be linear runtime and you should have to explicitly ask for the richer feature set (and opt into the assertion that you trust the input not to DOS your process).

Could easily be added as a modifier (see `man perlre`), but should be implemented as two to enable explicit behavior and toggling the default. Randomly picking the letter N:

    /(\w+) \1/n    # Error: look-behind is incompatible with linear runtime RE engine
    /(\w+) \1/N    # Works!
    /(\w+) \1/     # Preferable works for backwards compat, maybe overridden by an ENV var

mdwrigh214y ago

I don't think you need to even go as far as adding a modifier. A smart enough regex engine would know when it could use the linear runtime algorithm, and when it needs to fall back.

2 more replies

bambax14y ago

This post is very interesting and informative, esp. the part about indexing trigrams instead of words:

> Regular expression matches do not always line up nicely on word boundaries, so the inverted index cannot be based on words like in the previous example. Instead, we can use an old information retrieval trick and build an index of n-grams, substrings of length n. This sounds more general than it is. In practice, there are too few distinct 2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is.

As he explains, in order to perform a search using regular expressions, the RE is first converted to an "ordinary" query string, which finds a set of candidate documents; the candidate documents are then loaded in memory, and the "real" regular expression run against them, in order to find only the matching documents.

He used Google's retrieval engine in order to build the trigram index, but he doesn't say how he identified "code" amidst the ocean of ordinary web pages?

He does say this regarding an example implementation:

> The indexer (...) rejects files that are unlikely to be interesting, such as those that (...) have very long lines, or that have a very large number of distinct trigrams.

so maybe that's what Code Search did too.

What I'm wondering is this: wouldn't it be interesting to have a full web index based on trigrams, that would let us search not only using RE but also using wildcards (at the end of words or at the beginning)?

Maybe it would be too complex to build such an index for the whole web, but for limited corpora (such as one's mail) it would be very useful.

toddc14y ago

Russ's articles are an excellent write-up and explanation.

However, many finite-state automata regex implementations have existed for years (e.g. Java http://cs.au.dk/~amoeller/automaton) without the backtracking feature, of course. Also of interest is the benchmark data at: http://tusker.org/regex/regex_benchmark.html

mdwrigh214y ago

> However, many finite-state automata regex implementations have existed for years

If you read his write-up on RegEx matching, you'll see notes that Thompson wrote an implementation in the mid-60s, so he definitely doesn't claim they're new. What he does claim is that most regex libraries don't use them, even when the regex they're matching to doesn't require backtracking.

staunch14y ago

I'd love some idea of how large the index was for code search, how many machines it required, and how much total code it was searching.

ch14y ago

And here I had always thought Google Code Search was based on some kind of fancy federated radix-tree. Very nice design Russ.

petdog14y ago

This is a really great tool. If it could take the first .csearchindex going up the tree as the current index (somewhat like git does with .git dirs), it could easily top rgrep/ack for searching into projects. (just add line numbers and some match coloring)

ximeng14y ago

Not only does Russ Cox write code and English really well, but he's also on HN. Thanks for the article rsc!

http://news.ycombinator.com/user?id=rsc

dr_rezzy14y ago

I like. Word splitting is very interesting. Today, 2012, I would be hard pressed to provide a reason to use this technique. Splitting an index is a classic complexity/resource trade off (your index has a very predictable compact footprint). Again, today, memory is cheap, wide, uniform, and predictable. Indexes are now cheap and highly specialized. Complexity can be reduced for simplicity. Index specialization now becomes natural. My point here is that this solves a class of very expensive searches with ease, leading wildcard searches et al. Also, couldnt really tell from your code (you may be doing this), but reverse your trigrams in your generated query. If ordered properly, your search will be a lot more efficient.

dmoy14y ago

What were the criteria for determining that there were too many unique 2-grams and too few 3-grams? Did it just come down to too much memory for the former, and barely enough memory for 3-grams?

rsc14y ago

Compare 256^2, 256^3, 256^4.

brown9-214y ago

Wow, is there anything at Google that Jeff Dean didn't have a hand in?

_investigator14y ago

tl;dr

The original basic RE and extended RE (when backreferencing is not used) are significantly faster than implementations that most programmers traditionally rave about, e.g., Perl RE.

Tell me something I didn't know.

He thus used such 30 year old code as a model and easily topped the speeds of the built-in RE capabilities of today's popular scripting languages.

Common sense is underrated.

rsc14y ago

Wow, everyone's a cynic. Did you miss the part about the trigram index?

It sounds like you are replying to regexp1.html, not regexp4.html.

_investigator14y ago

No. You misread. I love Russ Cox's way of thinking and his continual generosity in sharing his work. I too use the Plan 9 base sed and grep. I'm cynical about the people who rave on about perl/javascript/python/ruby/whatever regexp, who derive some perverse joy in ridiculously complex RE and who often diss sed and all things old school as being slow or deficient in some way.

1 more reply

dennisgorelik14y ago

It's a nice design (in particular Trigram Index), but overall product still failed.

My guess is that regular expression search is not as useful as full-text search that general Google Search does.

ajasmin14y ago

Finding code with the regular Google Search is nearly impossible though.

dennisgorelik14y ago

I'm finding code with regular Google Search all the time.

Yes, General Google Search is missing some neat features, but overall these features are not as important as convenience of using familiar general search queries, search speed, and the size of general google search index.

BTW, do you have your own explanation of why Google Code Search was cancelled?

1 more reply

j / k navigate · click thread line to collapse

45 comments

gruseom14y ago

Google made a mistake in killing code search. Indexing the world's source code and making it searchable is so obviously part of their core mission that I wonder how this decision even got made.

In any case, congratulations Russ Cox et. al. on a brilliant piece of work.

pshc14y ago

From Steve Yegge's most recent blog rant:

Now, as it happens, I am in fact working on a very cool project at Google. [...] a project that aims to turn source code -- ALL source code -- from plain text into Wikipedia.

gruseom14y ago

I get that Steve told Larry to focus, but "code" and "search" almost define focus in their case.

Edit: It seems the service is still available under a different URL. Weird, but I'll happily take it! http://news.ycombinator.com/item?id=3487950

1 more reply

boyter14y ago

You mentioned hoping someone can come along to get this right. I am certainly trying with http://searchco.de/

Its still a long way from being close to Google code search both in terms of code indexed (amending that as I write this) but I hope to get things up-to a par as soon as I possibly can.

Symbolhound http://symbolhound.com/ also has a code index that's worth a look too.

jauco14y ago

That's a seriously cool website. You're obviously still developing it, but way to go!

1 more reply

dncrane14y ago

SymbolHound developer here, thanks for the mention boyter!

1 more reply

inerte14y ago

Did it stop working internally? Do Google employees still get to use it?

17814y ago

http://googlesystem.blogspot.com/2012/01/google-code-search-... —» http://code.google.com/codesearch

rsc14y ago

In the long term, this will be search only over code.google.com projects, as mentioned in the 'discussion' link on codesearch.google.com.

1 more reply

singular14y ago

I sometimes wonder whether Russ Cox is actually human or is in fact a collective pen name for a group of very talented hackers :)

cdibona14y ago

You're not alone. I like to think of him as 'Future japan prize winner Russ Cox'

boyter14y ago

Interesting solution. I did something a little different for searchco.de when I was implementing regex search.

Its actually more complex then that but with some other trickery (infix and prefix indexing) but it works reasonably well although I am still ironing out some kinks.

ww52014y ago

Interesting.

  What does the wildcard match "foo.*bar" expand to?
  How do you handle those pathological cases like "f.*r"?

Edit: HN can't displaying the wildcard char. Reformat it.

boyter14y ago

Much of the same. Most indexers these days allow a proximity search. So you can expand out "foo.bar" into the search "foo << *bar".

As for the other case you will naturally find all sorts of things that match. But as with the method in the article the more information you give it the closer a match you will find.

I don't know if the best/worst case is any better then the linked but it does work reasonably well.

bri3d14y ago

arnsholt14y ago

fasdg14y ago

snprbob8614y ago

Seems like the default should be linear runtime and you should have to explicitly ask for the richer feature set (and opt into the assertion that you trust the input not to DOS your process).

Could easily be added as a modifier (see `man perlre`), but should be implemented as two to enable explicit behavior and toggling the default. Randomly picking the letter N:

    /(\w+) \1/n    # Error: look-behind is incompatible with linear runtime RE engine
    /(\w+) \1/N    # Works!
    /(\w+) \1/     # Preferable works for backwards compat, maybe overridden by an ENV var

mdwrigh214y ago

I don't think you need to even go as far as adding a modifier. A smart enough regex engine would know when it could use the linear runtime algorithm, and when it needs to fall back.

2 more replies

bambax14y ago

This post is very interesting and informative, esp. the part about indexing trigrams instead of words:

He used Google's retrieval engine in order to build the trigram index, but he doesn't say how he identified "code" amidst the ocean of ordinary web pages?

He does say this regarding an example implementation:

> The indexer (...) rejects files that are unlikely to be interesting, such as those that (...) have very long lines, or that have a very large number of distinct trigrams.

so maybe that's what Code Search did too.

Maybe it would be too complex to build such an index for the whole web, but for limited corpora (such as one's mail) it would be very useful.

toddc14y ago

Russ's articles are an excellent write-up and explanation.

mdwrigh214y ago

> However, many finite-state automata regex implementations have existed for years

staunch14y ago

I'd love some idea of how large the index was for code search, how many machines it required, and how much total code it was searching.

ch14y ago

And here I had always thought Google Code Search was based on some kind of fancy federated radix-tree. Very nice design Russ.

petdog14y ago

ximeng14y ago

Not only does Russ Cox write code and English really well, but he's also on HN. Thanks for the article rsc!

http://news.ycombinator.com/user?id=rsc

dr_rezzy14y ago

dmoy14y ago

What were the criteria for determining that there were too many unique 2-grams and too few 3-grams? Did it just come down to too much memory for the former, and barely enough memory for 3-grams?

rsc14y ago

Compare 256^2, 256^3, 256^4.

brown9-214y ago

Wow, is there anything at Google that Jeff Dean didn't have a hand in?

_investigator14y ago

tl;dr

The original basic RE and extended RE (when backreferencing is not used) are significantly faster than implementations that most programmers traditionally rave about, e.g., Perl RE.

Tell me something I didn't know.

He thus used such 30 year old code as a model and easily topped the speeds of the built-in RE capabilities of today's popular scripting languages.

Common sense is underrated.

rsc14y ago

Wow, everyone's a cynic. Did you miss the part about the trigram index?

It sounds like you are replying to regexp1.html, not regexp4.html.

_investigator14y ago

1 more reply

dennisgorelik14y ago

It's a nice design (in particular Trigram Index), but overall product still failed.

My guess is that regular expression search is not as useful as full-text search that general Google Search does.

ajasmin14y ago

Finding code with the regular Google Search is nearly impossible though.

dennisgorelik14y ago

I'm finding code with regular Google Search all the time.

BTW, do you have your own explanation of why Google Code Search was cancelled?

1 more reply

j / k navigate · click thread line to collapse