Static analysis at GitHub (opens in new tab)

(cacm.acm.org)

178 pointstrptcolin4y ago19 comments

19 comments

I contributed to a similar system used within Google (partially open source at kythe.io), that took the very different approach of integrating with the language-native toolchain for each language.

As this article describes, doing this requires per-language integrations and also effectively being able to "run the build" for any given code (because e.g. the C++ header search path can vary on a per-source-file basis), which is untenable for a codebase as large and varied as GitHub's. However, if you can make it work, you get the benefit of having the compiler's understanding of the semantics of the code, which is especially finicky in complex languages like C++ or, say, Rust.

For example, if you look at this[1] method call it refers to a symbol generated by a chain of macros, but the browser is still able to point you at the definition of it.

It's an interesting tradeoff to make: the GitHub approach likely doesn't handle corner cases like the above but it makes up for it in broad applicability and performance. I recall an IDE developer once telling me they made a similar tradeoff in code completion, in that it's better DX to pop up completions quickly even if they're "only" 99% correct.

(To be clear, I absolutely think the approach taken in the article was the right one for the domain they're working in, I was just contrasting it against my experience in a similar problem where we took a very different approach.)

[1] https://source.chromium.org/chromium/chromium/src/+/main:v8/...

dcreager4y ago

Note that this article describes our implementation of “search-based” or “ctags-like” Code Navigation, which definitely has the imprecision that you describe. We've also been working over the previous ~year on a framework called Stack Graphs [1,2,3], which lets us tackle “precise” Code Navigation while still having the zero-config and incremental aspects that are described in the paper.

The build-based approach that you describe is also used by the Language Server Protocol (LSP) ecosystem. You've summarized the tradeoffs quite well! I've described a bit more about why we decided against a build-based/LSP approach here [4]. One of the biggest deciding factors is that at our scale, incremental processing is an absolute necessity, not a nice-to-have.

[1] https://github.blog/2021-12-09-introducing-stack-graphs/

[2] https://dcreager.net/talks/2021-strange-loop/

[3] https://news.ycombinator.com/item?id=29500602

[4] https://news.ycombinator.com/item?id=29501824

evmar4y ago

I read about stack graphs before, it sounds interesting!

I think they help, but ultimately I expect you need a compiler solve the absolute madness of the totality of C++. For example I think getting argument-dependent lookup right in the presence of 'auto' requires type information? And there are other categories of things (like header search paths) where I think you are forced to involve the build system too.

1 more reply

nerdponx4y ago

Has Tree Sitter been useful to projects like this? Does it have promise to be useful in the future? It seems to be gaining a lot of adoption among Neovim users and plugin developers, but not really anywhere else. I'm curious if that's because of lack of familiarity, or because it's technically deficient somehow.

1 more reply

beyang4y ago

Sourcegraph CTO here. It's interesting to read about GitHub's approach and how it contrasts with the approach we've taken at Sourcegraph. One of the key tradeoffs the article highlights is GitHub's decision to take the "shallow-but-wide" approach to code navigation, which has enabled them to provide some level of code navigation for most open-source repositories on GitHub, but at the expense of precision/accuracy (i.e., the system can't necessarily differentiate between different symbols with the same name).

Sourcegraph decided early on to take the opposite approach, favoring precision and accuracy over supporting every public codebase. Part of the reason why is that we aren't a code host that hosts millions of open-source repositories, so we didn't feel the need to support all of those at once. Another big reason is we heard from our users and customers that code navigation accuracy was critical for exploring their private code and enabling them to stay in flow (inaccurate results would break the train of thought because you'd have to actively think about how to navigate to the referenced symbol). We actually built out a language-agnostic search-based code navigation, but increasingly user feedback has driven us to adopt a more precise model, based at first on our own protocol (https://srclib.org) and also the LSIF protocol open-sourced by Microsoft that now enables code navigation for many popular editor extensions.

This is not to say that GitHub's approach is wrong, but more to say that it's interesting how different goals and constraints have led to systems that are quite different despite tackling the same general problem. GitHub aiming to provide some level of navigation to every repository on GitHub, and Sourcegraph aiming to provide best-in-class navigation for private codebases and dependencies.

(Btw, hats off to the GitHub team for open-sourcing tree-sitter, a great library which we've incorporated into parts of our stack. We actually hosted the creator of tree-sitter, Max Brunsfeld, on our podcast awhile back and it was a really fun and insightful conversation if people are interested in hearing some of the backstory of tree-sitter: https://about.sourcegraph.com/podcast/max-brunsfeld.)

ushakov4y ago

you should also highlight that your solution has a significant downside: you have to generate the definitions yourself in order for them to be displayed in source graph

this means you’d have to setup a CI/CD to rebuild index each time you make changes to the code + host it on your own

what’s great in GitHub’s approach is they take that obligation away from customers

TYMorningCoffee4y ago

I didn't have to do anything and sourcegraph indexed my project: https://sourcegraph.com/github.com/josephmate/OdinCodeBrowse...

Are you taking about a self hosted or private repo? In that case, I am not familiar.

1 more reply

beliu4y ago

Actually that’s not quite right. We offer auto-indexing now for some languages with more on the way. If neither auto-indexing nor manual LSIF generation are an option, we fall back to our search-based code navigation, which, similar to GitHub’s implementation, trades off accuracy for zero configuration (GitHub’s approach is parser-based while we use a combo of parsing, search, and heuristics that eliminates the need for an index entirely in some cases). This is all documented here for those curious to learn more: https://docs.sourcegraph.com/code_intelligence.

TYMorningCoffee4y ago

GitHub released a great java parser that I think is related to this work https://github.com/javaparser/javaparser

I'm also using that parser for a side project where developers can cross link their source code and host them statically: https://github.com/josephmate/OdinCodeBrowser#readme

gandalfgeek4y ago

Short (~5 min) video summary, if that's your thing: https://youtu.be/YoOFJApmPKc

marceloabsousa4y ago

This article more about parsing at scale than static analysis at scale.

dcreager4y ago

Parsing is definitely a big part of it, and it's a fair point that for search-based Code Navigation, we don't have to do any real heavy lifting on the analysis side. That said, I think the article describes our non-functional requirements well (zero-config, incremental, language-agnostic). It's those non-functional requirements which are most important for the “at scale” part. I'd go so far as to suggest that any static analysis implementation that can't meet those requirements would be nigh impossible for us to roll out across the entire GitHub corpus.

ushakov4y ago

tree-sitter is a phenomenal project

i’m doing a git-related project myself and use it to generate symbols for source code

if you’re into it too, i recommend also checking out LSIF: https://github.com/Microsoft/language-server-protocol/blob/m...

miohtama4y ago

This is Big Code

mistrial94y ago

figure 2 is repeated for some reason?

spatulon4y ago

That's odd. The error was not present in the original publication of this article in ACM's Queue magazine: https://queue.acm.org/detail.cfm?id=3487022

j / k navigate · click thread line to collapse

19 comments

evmar4y ago

I contributed to a similar system used within Google (partially open source at kythe.io), that took the very different approach of integrating with the language-native toolchain for each language.

For example, if you look at this[1] method call it refers to a symbol generated by a chain of macros, but the browser is still able to point you at the definition of it.

[1] https://source.chromium.org/chromium/chromium/src/+/main:v8/...

dcreager4y ago

[1] https://github.blog/2021-12-09-introducing-stack-graphs/

[2] https://dcreager.net/talks/2021-strange-loop/

[3] https://news.ycombinator.com/item?id=29500602

[4] https://news.ycombinator.com/item?id=29501824

evmar4y ago

I read about stack graphs before, it sounds interesting!

1 more reply

nerdponx4y ago

1 more reply

beyang4y ago

ushakov4y ago

you should also highlight that your solution has a significant downside: you have to generate the definitions yourself in order for them to be displayed in source graph

this means you’d have to setup a CI/CD to rebuild index each time you make changes to the code + host it on your own

what’s great in GitHub’s approach is they take that obligation away from customers

TYMorningCoffee4y ago

I didn't have to do anything and sourcegraph indexed my project: https://sourcegraph.com/github.com/josephmate/OdinCodeBrowse...

Are you taking about a self hosted or private repo? In that case, I am not familiar.

1 more reply

beliu4y ago

TYMorningCoffee4y ago

GitHub released a great java parser that I think is related to this work https://github.com/javaparser/javaparser

I'm also using that parser for a side project where developers can cross link their source code and host them statically: https://github.com/josephmate/OdinCodeBrowser#readme

gandalfgeek4y ago

Short (~5 min) video summary, if that's your thing: https://youtu.be/YoOFJApmPKc

marceloabsousa4y ago

This article more about parsing at scale than static analysis at scale.

dcreager4y ago

ushakov4y ago

tree-sitter is a phenomenal project

i’m doing a git-related project myself and use it to generate symbols for source code

if you’re into it too, i recommend also checking out LSIF: https://github.com/Microsoft/language-server-protocol/blob/m...

miohtama4y ago

This is Big Code

mistrial94y ago

figure 2 is repeated for some reason?

spatulon4y ago

That's odd. The error was not present in the original publication of this article in ACM's Queue magazine: https://queue.acm.org/detail.cfm?id=3487022

j / k navigate · click thread line to collapse