As this article describes, doing this requires per-language integrations and also effectively being able to "run the build" for any given code (because e.g. the C++ header search path can vary on a per-source-file basis), which is untenable for a codebase as large and varied as GitHub's. However, if you can make it work, you get the benefit of having the compiler's understanding of the semantics of the code, which is especially finicky in complex languages like C++ or, say, Rust.
For example, if you look at this[1] method call it refers to a symbol generated by a chain of macros, but the browser is still able to point you at the definition of it.
It's an interesting tradeoff to make: the GitHub approach likely doesn't handle corner cases like the above but it makes up for it in broad applicability and performance. I recall an IDE developer once telling me they made a similar tradeoff in code completion, in that it's better DX to pop up completions quickly even if they're "only" 99% correct.
(To be clear, I absolutely think the approach taken in the article was the right one for the domain they're working in, I was just contrasting it against my experience in a similar problem where we took a very different approach.)
[1] https://source.chromium.org/chromium/chromium/src/+/main:v8/...
The build-based approach that you describe is also used by the Language Server Protocol (LSP) ecosystem. You've summarized the tradeoffs quite well! I've described a bit more about why we decided against a build-based/LSP approach here [4]. One of the biggest deciding factors is that at our scale, incremental processing is an absolute necessity, not a nice-to-have.
[1] https://github.blog/2021-12-09-introducing-stack-graphs/
[2] https://dcreager.net/talks/2021-strange-loop/
I think they help, but ultimately I expect you need a compiler solve the absolute madness of the totality of C++. For example I think getting argument-dependent lookup right in the presence of 'auto' requires type information? And there are other categories of things (like header search paths) where I think you are forced to involve the build system too.
Sourcegraph decided early on to take the opposite approach, favoring precision and accuracy over supporting every public codebase. Part of the reason why is that we aren't a code host that hosts millions of open-source repositories, so we didn't feel the need to support all of those at once. Another big reason is we heard from our users and customers that code navigation accuracy was critical for exploring their private code and enabling them to stay in flow (inaccurate results would break the train of thought because you'd have to actively think about how to navigate to the referenced symbol). We actually built out a language-agnostic search-based code navigation, but increasingly user feedback has driven us to adopt a more precise model, based at first on our own protocol (https://srclib.org) and also the LSIF protocol open-sourced by Microsoft that now enables code navigation for many popular editor extensions.
This is not to say that GitHub's approach is wrong, but more to say that it's interesting how different goals and constraints have led to systems that are quite different despite tackling the same general problem. GitHub aiming to provide some level of navigation to every repository on GitHub, and Sourcegraph aiming to provide best-in-class navigation for private codebases and dependencies.
(Btw, hats off to the GitHub team for open-sourcing tree-sitter, a great library which we've incorporated into parts of our stack. We actually hosted the creator of tree-sitter, Max Brunsfeld, on our podcast awhile back and it was a really fun and insightful conversation if people are interested in hearing some of the backstory of tree-sitter: https://about.sourcegraph.com/podcast/max-brunsfeld.)
this means you’d have to setup a CI/CD to rebuild index each time you make changes to the code + host it on your own
what’s great in GitHub’s approach is they take that obligation away from customers
Are you taking about a self hosted or private repo? In that case, I am not familiar.
I'm also using that parser for a side project where developers can cross link their source code and host them statically: https://github.com/josephmate/OdinCodeBrowser#readme
i’m doing a git-related project myself and use it to generate symbols for source code
if you’re into it too, i recommend also checking out LSIF: https://github.com/Microsoft/language-server-protocol/blob/m...