Our focus has been on very large scale, multi-language code indexing, and then low latency (e.g. hundreds of micros) query times, to drive highly interactive developer workflows.
Specifically, things like "Go to definition," and tab completion have been in industry-leading IDEs for at least 20 years.
What's novel about Glean? It seems like a lot of hoops to jump through when Visual Studio (and Visual Studio Code) can index a very large codebase in a few seconds. (And don't require a server and database to do it.)
Perhaps a 20-second video (no sound) showing what Glean does that other IDEs don't will help get the message across?
I think you are not thinking large enough. An IDE absolutely can not index a very large codebase and allow users to make complex queries on it. Think multiple millions lines of code here. The use case is closer to "find me all the variables of this type or a type derived from it in all the projects at Facebook" than "go to this definition in the project I'm currently editing".
Facebook could spend a lot of money to get engineers beefy workstations, and then have each of these workstations clone the same repository and build the same index locally.
Or, they could leverage the custom built servers in their data centers (which are already more energy-efficient than the laptops), build a single index of the repo, and serve queries on-demand from IDEs throughout the company.
I could also see an analytics angle to this if it could incorporate history and track engineering trends over time. In my experience, decision making in engineering around codebase maintenance is usually rooted in “experience” or “educating guessing” rather than identifying areas of high churn in the codebase or what not.
I'd add that I didn't want to click "get started" because i didn't know if it was a thing i wanted, and then "get started" actually took me to documentation, which is not what i expect from a "get started" button. The Documentation had the presumption that i wanted to use it, and thus the implication that i knew wtf "it" was.
I don't care about its efficiency, or declarative language, or any of that when i still don't know what we're talking about.
- find references / go to definition for web tools, like when reviewing pull requests
- multi-language refactoring, e.g. modifying C bindings
- building structural static analysis tools like coccinelle, or semgrep, but better
When doing large system refactoring searching by code patterns is the number one thing I'd like to have a tool for. For example being able to query for all for loops in a codebase that have a call to function X within their body.
And what would be the disk and memory requirements for this? Could they be distributed across a handful of servers?
Imagine an IDE plugin that queries Glean over the network for symbol information about the current file, then shows that on hover. That sort of thing.
Seems like there are only indexers for Flow and Hack though.
Will there be more indexers built by Facebook, or will it rely on community contributions?
[0] https://docs.telemetry.mozilla.org/concepts/glean/glean.html
I would love to know what the usecase for this tool is aside from maybe being a source for presentations? (We have 5 million if statements).
How can this be used to improve code quality or any other aspect of the code lifecycle?
Or is it solving problems in a completely different problem area?
You would create entries like "this is a declaration of X", "this is a use of X". Then you can query things like "give me all uses of X" in sub-millisecond time. You hook that up to an LSP server then you get almost zero-cost find-references, jump-to-definition, etc. The snappy queries also mean it becomes possible to perform whole codebase (and cross-language) analysis. That is, answering questions like "what code is not referenced from this root?", "does this Haskell function use anything that calls malloc?" (analysis through the ffi barrier).
One can also attach all kinds of information from different sources to code entities, not only things derived from the source itself. You add things like run-time costs, frequency of use, common errors, etc, and an LSP server could make all of it available right in your editor.
For very large or complex codebases, where it is just too expensive or too complicated to calculate this information locally a system like this becomes very useful.
Thanks I guess I get it now. But to enable this functionality you'd need to have some form of frontend or integration into the existing build lifecycle?
Or IDE integration I guess.
One of the pain points using Kythe is wiring up the indexer to the build system. Would Glean indexers be easier to wire up for the common cases?
Other is the index post-processing, which is not very scalable in the open source version (due to go-beam having rough Flunk support, for example).
Third, how does it link up references across compilation units? Is it heuristic, or relies on unique keys from indexers matching? Or across languages?
For wiring up the indexer, there are various methods, it tends to depend very much on the language and the build system. For Flow for example, Glean output is just built into the typechecker, you just run it with some flags to spit out the Glean data. For C++, you need to get the compiler flags from the build system to pass to the Clang frontend. For Java the indexer is a compiler plugin; for Python it's built on libCST. Some indexers send their data directly to a Glean server, others generate files of JSON that get sent using a separate command-line tool.
References use different methods depending on the language. For Flow for example there is a fact for an import that matches up with a fact for the export in the other file. For C++ there are facts that connect declarations with definitions, and references with declarations.
One major limitation of Kythe is handling different versions. For example, Kythe can produce a well connected index of Stackage, but a Hackage would have many holes (not all references would be found, as the unique reference name needs the library version). How Glean handles different library versions?
EDIT: the language agnostic view is already mentioned.
Glean seems to still be work in progress, e.g. no support for recursive queries yet, but I wonder where they're heading. I'll certainly keep an eye on the project but I wonder how exactly Glean aims to -- or maybe it already does -- improve upon the alternatives? From the talk linked in another comment I guess the distinctive feature may be the planned integration with IDEs. Correct me if I'm wrong. Other contenders provide great querying technology but there is indeed no strong focus on making such tech really convenient and integrated yet.
How do I write a schema and indexer for my favorite programming language that isn't currently (and won't be) supported with official releases?
For Schemas, [1] says to modify (or base new ones off) these: https://github.com/facebookincubator/Glean/tree/main/glean/s...
For Indexers, it's a little less clear but it looks like I need to write my own type checker?
And continuing off of that theme in practical terms how does it stand up against zoekt?
I’m curious because zoekt is kind of slow when it comes to ingesting large amounts of code like all of the publicly available code on GitHub
The few people using that commercially have basically had to spend a lot of time rewriting parts of it to make their goal of public codesearch for all attainable.
I and a few people I know are pretty convinced that there are better and easier ways / technologies to make that happen.
As long as billions of people keep using Facebook they can maintain their own static analysis tooling for Javascript for as long as they want.
That seems like a very tractable machine learning problem, yet all I could find was a single python library which looks nice, but doesn't have much adoption, and requires installing the entirety of tensorflow despite the fact that users just want a trained model and a predict() function.
Why doesn't a popular library like this exist?