Glean – System for collecting, deriving and querying facts about source code (opens in new tab)

(glean.software)

213 pointsdons4y ago81 comments

81 comments

donsOP4y ago

We use this to power things like find-references or jump-to-def, "symbol search" and autocomplete, or more complicated code queries and analysis (even across languages). Imagine rich LSPs without a local checkout, web-based code queries, or seeding fuzzers and static analyzers with entry points in code.

Our focus has been on very large scale, multi-language code indexing, and then low latency (e.g. hundreds of micros) query times, to drive highly interactive developer workflows.

gwbas1c4y ago

I'm really struggling to understand what Glean does, and why I would use it. Most important: Your landing page should quickly show what Glean does that a typical IDE (Visual Studio, Visual Studio Code, Eclipse, ect, does.)

Specifically, things like "Go to definition," and tab completion have been in industry-leading IDEs for at least 20 years.

What's novel about Glean? It seems like a lot of hoops to jump through when Visual Studio (and Visual Studio Code) can index a very large codebase in a few seconds. (And don't require a server and database to do it.)

Perhaps a 20-second video (no sound) showing what Glean does that other IDEs don't will help get the message across?

WastingMyTime894y ago

> It seems like a lot of hoops to jump through when Visual Studio (and Visual Studio Code) can index a very large codebase in a few seconds.

I think you are not thinking large enough. An IDE absolutely can not index a very large codebase and allow users to make complex queries on it. Think multiple millions lines of code here. The use case is closer to "find me all the variables of this type or a type derived from it in all the projects at Facebook" than "go to this definition in the project I'm currently editing".

2 more replies

conradev4y ago

This makes a lot of sense to me through an efficiency lens.

Facebook could spend a lot of money to get engineers beefy workstations, and then have each of these workstations clone the same repository and build the same index locally.

Or, they could leverage the custom built servers in their data centers (which are already more energy-efficient than the laptops), build a single index of the repo, and serve queries on-demand from IDEs throughout the company.

I could also see an analytics angle to this if it could incorporate history and track engineering trends over time. In my experience, decision making in engineering around codebase maintenance is usually rooted in “experience” or “educating guessing” rather than identifying areas of high churn in the codebase or what not.

masukomi4y ago

100% same take.

I'd add that I didn't want to click "get started" because i didn't know if it was a thing i wanted, and then "get started" actually took me to documentation, which is not what i expect from a "get started" button. The Documentation had the presumption that i wanted to use it, and thus the implication that i knew wtf "it" was.

I don't care about its efficiency, or declarative language, or any of that when i still don't know what we're talking about.

n_jd4y ago

I don't know what Glean is used for, but here are some guesses for this kind of technology:

- find references / go to definition for web tools, like when reviewing pull requests

- multi-language refactoring, e.g. modifying C bindings

- building structural static analysis tools like coccinelle, or semgrep, but better

rad_gruchalski4y ago

Imagine that you pulled in all your dependencies in different languages in source + windows source and visual studio source. Now you want to click around that source. This is what this tool is for.

maccard4y ago

What size codebases do you have that a few seconds has visual studio fully indexing it? My experience with VS on large projects is that it takes however long the project takes to compile before it's usable, but many functions (go to definition) can occasionally hit a file that needs to be reparsed and can stall for minutes on end. I use Vs2019 on a 32 core workstation with 128GB ram, fwiw.

fnord774y ago

“Go to definition” has been around even longer, since at least the early 90s

1 more reply

gravypod4y ago

I see you support Thrift and Buck. Would you also be interested in adding Proto and Bazel support? Being able to query the code based on the build graph (sort of) would be very cool.

mhitza4y ago

Briefly skimmed the docs and it noted that it doesn't store expressions from the parsed AST. That means it's mostly a symbol lookup system?

When doing large system refactoring searching by code patterns is the number one thing I'd like to have a tool for. For example being able to query for all for loops in a codebase that have a call to function X within their body.

progval4y ago

How would it perform for, say, 500TB of source code?

And what would be the disk and memory requirements for this? Could they be distributed across a handful of servers?

dmos624y ago

I'd be surprised if this question could have an off hand answer. Doesn't sound like something that could have scalability predictable enough to do back of the envelope calculations on.

gricardo994y ago

What on earth has this much source code? Every open source project ever?

2 more replies

zerr4y ago

Since this is HN, could you please share more technical/impl details, e.g. what makes it more scalable and faster in general and also compared to other similar engines?

soonnow4y ago

Does that mean you are using the shell or how is it used to enable these functionalities?

donsOP4y ago

Most clients hit the Glean server via the network (thrift/JSON) and then mostly via language bindings to the Glean query language, Angle. The shell is more for debugging/exploration.

Imagine an IDE plugin that queries Glean over the network for symbol information about the current file, then shows that on hover. That sort of thing.

1 more reply

the_duke4y ago

This is really cool.

Seems like there are only indexers for Flow and Hack though.

Will there be more indexers built by Facebook, or will it rely on community contributions?

simonmar4y ago

There will be more indexers: we have Python, C++/Objective C, Rust, Java and Haskell. It's just a case of getting them ready to open source. You can see the schemas for most of these already in the repo: https://github.com/facebookincubator/Glean/tree/main/glean/s...

donsOP4y ago

A bit of both I think.

pdpi4y ago

Been away from Fb for a few years. How does this relate to tbgs?

gaogao4y ago

Jump to def is nice when biggrepping a piece of code a la what you can do with codesearch, cs.android.com

simonw4y ago

Feature request: a live demo! I would love to try out the web interface described at https://glean.software/docs/trying without pulling down a 7GB Docker image first.

jamessb4y ago

Even just a short video of someone using the web interface would be helpful.

conductor4y ago

To prevent any confusion, this is a different product than Mozilla's Glean [0][1].

[0] https://docs.telemetry.mozilla.org/concepts/glean/glean.html

[1] https://github.com/mozilla/glean/

senden94y ago

I was also confused first if (Mozilla) glean gained some out-of-scope features.

soonnow4y ago

I had a look at the site and it seems to be parsing source code in multiple languages and storing the parsed "syntax trees" into a database for querying.

I would love to know what the usecase for this tool is aside from maybe being a source for presentations? (We have 5 million if statements).

How can this be used to improve code quality or any other aspect of the code lifecycle?

Or is it solving problems in a completely different problem area?

lazamar4y ago

Glean is focused on storing and querying data about the code. The idea is that you have your own program to collect that data, then you use Glean to store that compactly and to have snappy queries.

You would create entries like "this is a declaration of X", "this is a use of X". Then you can query things like "give me all uses of X" in sub-millisecond time. You hook that up to an LSP server then you get almost zero-cost find-references, jump-to-definition, etc. The snappy queries also mean it becomes possible to perform whole codebase (and cross-language) analysis. That is, answering questions like "what code is not referenced from this root?", "does this Haskell function use anything that calls malloc?" (analysis through the ffi barrier).

One can also attach all kinds of information from different sources to code entities, not only things derived from the source itself. You add things like run-time costs, frequency of use, common errors, etc, and an LSP server could make all of it available right in your editor.

For very large or complex codebases, where it is just too expensive or too complicated to calculate this information locally a system like this becomes very useful.

soonnow4y ago

> For very large or complex codebases, where it is just too expensive or too complicated to calculate this information locally a system like this becomes very useful.

Thanks I guess I get it now. But to enable this functionality you'd need to have some form of frontend or integration into the existing build lifecycle?

Or IDE integration I guess.

scns4y ago

Oh wow, mindblowing stuff. Glad to see tech like this being open sourced, fuels the imagination about possible future scenarios. Do you use it on the Linux Kernel?

1 more reply

gigatexal4y ago

Thank you for this summary I was unsure of how this is really useful. That before step is missing I think.

coderdd4y ago

Great to see this space moving! Any pointers on diff vs Kythe? Kythe has a mostly fixed schema, for one.

One of the pain points using Kythe is wiring up the indexer to the build system. Would Glean indexers be easier to wire up for the common cases?

Other is the index post-processing, which is not very scalable in the open source version (due to go-beam having rough Flunk support, for example).

Third, how does it link up references across compilation units? Is it heuristic, or relies on unique keys from indexers matching? Or across languages?

simonmar4y ago

Kythe has one schema, whereas with Glean each language has its own schema with arbitrary amounts of language-specific detail. You can get a language-agnostic view by defining an abstraction layer as a schema. Our current (work in progress) language-agnostic layer is called "codemarkup" https://github.com/facebookincubator/Glean/blob/main/glean/s...

For wiring up the indexer, there are various methods, it tends to depend very much on the language and the build system. For Flow for example, Glean output is just built into the typechecker, you just run it with some flags to spit out the Glean data. For C++, you need to get the compiler flags from the build system to pass to the Clang frontend. For Java the indexer is a compiler plugin; for Python it's built on libCST. Some indexers send their data directly to a Glean server, others generate files of JSON that get sent using a separate command-line tool.

References use different methods depending on the language. For Flow for example there is a fact for an import that matches up with a fact for the export in the other file. For C++ there are facts that connect declarations with definitions, and references with declarations.

mrazomor4y ago

In case using Kythe was an option, what was the rationale for not using it?

One major limitation of Kythe is handling different versions. For example, Kythe can produce a well connected index of Stackage, but a Hackage would have many holes (not all references would be found, as the unique reference name needs the library version). How Glean handles different library versions?

EDIT: the language agnostic view is already mentioned.

Game_Ender4y ago

Is there an example of using the C++ indexer? I saw hack and JS on your site but missed C++ (Python would also be amazing!).

1 more reply

balddenimhero4y ago

Datalog-ish query languages sure is a fun area to be working in. Such DSLs exist for various domains and, like Semmle's codeQL or the more academic Soufflé, Glean focuses on the domain of programming languages.

Glean seems to still be work in progress, e.g. no support for recursive queries yet, but I wonder where they're heading. I'll certainly keep an eye on the project but I wonder how exactly Glean aims to -- or maybe it already does -- improve upon the alternatives? From the talk linked in another comment I guess the distinctive feature may be the planned integration with IDEs. Correct me if I'm wrong. Other contenders provide great querying technology but there is indeed no strong focus on making such tech really convenient and integrated yet.

donsOP4y ago

I think the point in the space Glean hits well is efficiency/latency (enough to power real time editing, like in IDE autocomplete or navigation), while having a schema and query language generic enough to do multiple languages and code-like things. You can accurately query JavaScript or Rust or PHP or Python or C++ with a common interface, which is a bit nuts :D

doddsiedodds4y ago

An excellent talk by Simon Marlow on Glean here: https://youtu.be/-OPN7QPsYKE

simonmar4y ago

I should point out that Glean has evolved quite a bit since that talk!

booleandilemma4y ago

The very first page of the site should have examples of what you can do with it.

_tom_4y ago

It’s such a shame to get this much traffic, and have it almost all bounce, because there’s no description of WHY someone would care.

aabaker994y ago

Cool! I would love to play around with this.

How do I write a schema and indexer for my favorite programming language that isn't currently (and won't be) supported with official releases?

For Schemas, [1] says to modify (or base new ones off) these: https://github.com/facebookincubator/Glean/tree/main/glean/s...

For Indexers, it's a little less clear but it looks like I need to write my own type checker?

[1] https://glean.software/docs/schema/workflow

z3t44y ago

This seems very interesting, would love to see more alternatives to TreeSitter and microsoft LSP - what makes those hard to use is lack of examples and tutorials. So I hope tbere will be examples and tutorials. For example: How do you find all variables in scope when the text cursor is on line x and col y in /file/path/file.js

Grimm14y ago

Very cool! How does this differ algorithmically from the trigram based search that everything uses from google code search from like 20 years ago?

And continuing off of that theme in practical terms how does it stand up against zoekt?

I’m curious because zoekt is kind of slow when it comes to ingesting large amounts of code like all of the publicly available code on GitHub

The few people using that commercially have basically had to spend a lot of time rewriting parts of it to make their goal of public codesearch for all attainable.

I and a few people I know are pretty convinced that there are better and easier ways / technologies to make that happen.

ExtraE4y ago

What, uh, is this? This is a space that I’m not familiar with and the linked site doesn’t make it super clear.

ctvo4y ago

Great job with this. What's your roadmap for releasing some of the tooling for editor integration? Really, the question is should I build something or wait a few weeks?

metalliqaz4y ago

We have used SciTools Understand to do this on local source code. What is the use of putting this in the cloud? The website doesn't really explain that.

tclancy4y ago

Getting a 401 when trying `docker pull ghcr.io/facebookincubator/glean/demo:latest` -- is that true for anyone else?

simonmar4y ago

Sorry about that, the package was still set to private. Try again now?

tclancy4y ago

All set, thanks!

log1014y ago

I didn't understand what it does

sealeck4y ago

It allows you to use a query language (think SQL) to analyze source code.

tyingq4y ago

The docs seem to get to queries here: https://glean.software/docs/angle/guide

avinassh4y ago

How does this actually work? Where can I learn more about the indexing and searching?

justinmchase4y ago

But whats an example of a fact? Looks cool but I have no idea what its for.

marcodiego4y ago

Is it a modern cscope?

rognjen4y ago

Meta: should the ?open tracking part of the URL be removed?

_jezell_4y ago

Is this basically Facebook's version of SourceGraph?

ing33k4y ago

7GB docker image !

erlich4y ago

I can't believe Facebook hasn't canned Flowtype yet and moved to TypeScript. They will have to do it eventually.

wingspan4y ago

The problem is that TypeScript does not scale to the size of the giant monorepo at Facebook, with hundreds of thousands, if not millions of files. Since they aren't organized into packages, it is just one giant flat namespace (any JS file can import any other JS file by the filename). It is pretty amazing to change a core file and see type errors across the entire codebase in a few seconds. The main way to scale in TypeScript is Project References, which don't work when you haven't separated your code into packages. (Worked at Facebook until June 2021).

muglug4y ago

I'm not sure you understand the scale at which Facebook operates. They don't have to do anything.

As long as billions of people keep using Facebook they can maintain their own static analysis tooling for Javascript for as long as they want.

scns4y ago

You do have a point, a rewrite on that scale would be a colossal waste of manyears/$$. Your delivery could be nicer though.

ctvo4y ago

Doesn't look like they're stopping their use of Hack either. Eventually is a long time so you're right.

da39a3ee4y ago

I was recently looking for a library that takes a few lines of source code as input, and predicts the programming language as output.

That seems like a very tractable machine learning problem, yet all I could find was a single python library which looks nice, but doesn't have much adoption, and requires installing the entirety of tensorflow despite the fact that users just want a trained model and a predict() function.

Why doesn't a popular library like this exist?

jamessb4y ago

GitHub's linguist library can be used to identify the programming language of a single file (edit: or of a whole project): https://github.com/github/linguist#single-file

da39a3ee4y ago

Thanks! My searches completely failed to find that. I can’t use it as a ruby library, but perhaps I can pull out the heuristics.yml and the naive bayes classifier weights to use in another language.

j / k navigate · click thread line to collapse

81 comments

donsOP4y ago

Our focus has been on very large scale, multi-language code indexing, and then low latency (e.g. hundreds of micros) query times, to drive highly interactive developer workflows.

gwbas1c4y ago

Specifically, things like "Go to definition," and tab completion have been in industry-leading IDEs for at least 20 years.

Perhaps a 20-second video (no sound) showing what Glean does that other IDEs don't will help get the message across?

WastingMyTime894y ago

> It seems like a lot of hoops to jump through when Visual Studio (and Visual Studio Code) can index a very large codebase in a few seconds.

2 more replies

conradev4y ago

This makes a lot of sense to me through an efficiency lens.

Facebook could spend a lot of money to get engineers beefy workstations, and then have each of these workstations clone the same repository and build the same index locally.

masukomi4y ago

100% same take.

I don't care about its efficiency, or declarative language, or any of that when i still don't know what we're talking about.

n_jd4y ago

I don't know what Glean is used for, but here are some guesses for this kind of technology:

- find references / go to definition for web tools, like when reviewing pull requests

- multi-language refactoring, e.g. modifying C bindings

- building structural static analysis tools like coccinelle, or semgrep, but better

rad_gruchalski4y ago

Imagine that you pulled in all your dependencies in different languages in source + windows source and visual studio source. Now you want to click around that source. This is what this tool is for.

maccard4y ago

fnord774y ago

“Go to definition” has been around even longer, since at least the early 90s

1 more reply

gravypod4y ago

I see you support Thrift and Buck. Would you also be interested in adding Proto and Bazel support? Being able to query the code based on the build graph (sort of) would be very cool.

mhitza4y ago

Briefly skimmed the docs and it noted that it doesn't store expressions from the parsed AST. That means it's mostly a symbol lookup system?

progval4y ago

How would it perform for, say, 500TB of source code?

And what would be the disk and memory requirements for this? Could they be distributed across a handful of servers?

dmos624y ago

I'd be surprised if this question could have an off hand answer. Doesn't sound like something that could have scalability predictable enough to do back of the envelope calculations on.

gricardo994y ago

What on earth has this much source code? Every open source project ever?

2 more replies

zerr4y ago

Since this is HN, could you please share more technical/impl details, e.g. what makes it more scalable and faster in general and also compared to other similar engines?

soonnow4y ago

Does that mean you are using the shell or how is it used to enable these functionalities?

donsOP4y ago

Most clients hit the Glean server via the network (thrift/JSON) and then mostly via language bindings to the Glean query language, Angle. The shell is more for debugging/exploration.

Imagine an IDE plugin that queries Glean over the network for symbol information about the current file, then shows that on hover. That sort of thing.

1 more reply

the_duke4y ago

This is really cool.

Seems like there are only indexers for Flow and Hack though.

Will there be more indexers built by Facebook, or will it rely on community contributions?

simonmar4y ago

donsOP4y ago

A bit of both I think.

pdpi4y ago

Been away from Fb for a few years. How does this relate to tbgs?

gaogao4y ago

Jump to def is nice when biggrepping a piece of code a la what you can do with codesearch, cs.android.com

simonw4y ago

Feature request: a live demo! I would love to try out the web interface described at https://glean.software/docs/trying without pulling down a 7GB Docker image first.

jamessb4y ago

Even just a short video of someone using the web interface would be helpful.

conductor4y ago

To prevent any confusion, this is a different product than Mozilla's Glean [0][1].

[0] https://docs.telemetry.mozilla.org/concepts/glean/glean.html

[1] https://github.com/mozilla/glean/

senden94y ago

I was also confused first if (Mozilla) glean gained some out-of-scope features.

soonnow4y ago

I had a look at the site and it seems to be parsing source code in multiple languages and storing the parsed "syntax trees" into a database for querying.

I would love to know what the usecase for this tool is aside from maybe being a source for presentations? (We have 5 million if statements).

How can this be used to improve code quality or any other aspect of the code lifecycle?

Or is it solving problems in a completely different problem area?

lazamar4y ago

Glean is focused on storing and querying data about the code. The idea is that you have your own program to collect that data, then you use Glean to store that compactly and to have snappy queries.

For very large or complex codebases, where it is just too expensive or too complicated to calculate this information locally a system like this becomes very useful.

soonnow4y ago

> For very large or complex codebases, where it is just too expensive or too complicated to calculate this information locally a system like this becomes very useful.

Thanks I guess I get it now. But to enable this functionality you'd need to have some form of frontend or integration into the existing build lifecycle?

Or IDE integration I guess.

scns4y ago

Oh wow, mindblowing stuff. Glad to see tech like this being open sourced, fuels the imagination about possible future scenarios. Do you use it on the Linux Kernel?

1 more reply

gigatexal4y ago

Thank you for this summary I was unsure of how this is really useful. That before step is missing I think.

coderdd4y ago

Great to see this space moving! Any pointers on diff vs Kythe? Kythe has a mostly fixed schema, for one.

One of the pain points using Kythe is wiring up the indexer to the build system. Would Glean indexers be easier to wire up for the common cases?

Other is the index post-processing, which is not very scalable in the open source version (due to go-beam having rough Flunk support, for example).

Third, how does it link up references across compilation units? Is it heuristic, or relies on unique keys from indexers matching? Or across languages?

simonmar4y ago

mrazomor4y ago

In case using Kythe was an option, what was the rationale for not using it?

EDIT: the language agnostic view is already mentioned.

Game_Ender4y ago

Is there an example of using the C++ indexer? I saw hack and JS on your site but missed C++ (Python would also be amazing!).

1 more reply

balddenimhero4y ago

donsOP4y ago

doddsiedodds4y ago

An excellent talk by Simon Marlow on Glean here: https://youtu.be/-OPN7QPsYKE

simonmar4y ago

I should point out that Glean has evolved quite a bit since that talk!

booleandilemma4y ago

The very first page of the site should have examples of what you can do with it.

_tom_4y ago

It’s such a shame to get this much traffic, and have it almost all bounce, because there’s no description of WHY someone would care.

aabaker994y ago

Cool! I would love to play around with this.

How do I write a schema and indexer for my favorite programming language that isn't currently (and won't be) supported with official releases?

For Schemas, [1] says to modify (or base new ones off) these: https://github.com/facebookincubator/Glean/tree/main/glean/s...

For Indexers, it's a little less clear but it looks like I need to write my own type checker?

[1] https://glean.software/docs/schema/workflow

z3t44y ago

Grimm14y ago

Very cool! How does this differ algorithmically from the trigram based search that everything uses from google code search from like 20 years ago?

And continuing off of that theme in practical terms how does it stand up against zoekt?

I’m curious because zoekt is kind of slow when it comes to ingesting large amounts of code like all of the publicly available code on GitHub

The few people using that commercially have basically had to spend a lot of time rewriting parts of it to make their goal of public codesearch for all attainable.

I and a few people I know are pretty convinced that there are better and easier ways / technologies to make that happen.

ExtraE4y ago

What, uh, is this? This is a space that I’m not familiar with and the linked site doesn’t make it super clear.

ctvo4y ago

Great job with this. What's your roadmap for releasing some of the tooling for editor integration? Really, the question is should I build something or wait a few weeks?

metalliqaz4y ago

We have used SciTools Understand to do this on local source code. What is the use of putting this in the cloud? The website doesn't really explain that.

tclancy4y ago

Getting a 401 when trying `docker pull ghcr.io/facebookincubator/glean/demo:latest` -- is that true for anyone else?

simonmar4y ago

Sorry about that, the package was still set to private. Try again now?

tclancy4y ago

All set, thanks!

log1014y ago

I didn't understand what it does

sealeck4y ago

It allows you to use a query language (think SQL) to analyze source code.

tyingq4y ago

The docs seem to get to queries here: https://glean.software/docs/angle/guide

avinassh4y ago

How does this actually work? Where can I learn more about the indexing and searching?

justinmchase4y ago

But whats an example of a fact? Looks cool but I have no idea what its for.

marcodiego4y ago

Is it a modern cscope?

rognjen4y ago

Meta: should the ?open tracking part of the URL be removed?

_jezell_4y ago

Is this basically Facebook's version of SourceGraph?

ing33k4y ago

7GB docker image !

erlich4y ago

I can't believe Facebook hasn't canned Flowtype yet and moved to TypeScript. They will have to do it eventually.

wingspan4y ago

muglug4y ago

I'm not sure you understand the scale at which Facebook operates. They don't have to do anything.

As long as billions of people keep using Facebook they can maintain their own static analysis tooling for Javascript for as long as they want.

scns4y ago

You do have a point, a rewrite on that scale would be a colossal waste of manyears/$$. Your delivery could be nicer though.

ctvo4y ago

Doesn't look like they're stopping their use of Hack either. Eventually is a long time so you're right.

da39a3ee4y ago

I was recently looking for a library that takes a few lines of source code as input, and predicts the programming language as output.

Why doesn't a popular library like this exist?

jamessb4y ago

GitHub's linguist library can be used to identify the programming language of a single file (edit: or of a whole project): https://github.com/github/linguist#single-file

da39a3ee4y ago

j / k navigate · click thread line to collapse