(One language that does some of this is Erlang, but AFAIK the stdlib data structures like digraphs, sets-of-sets, etc. aren’t actually the ones the compiler uses, but are rather there for use by static verification tools like Dialyzer. Which means that the Erlang digraph doesn’t know how to topsort itself, even though there’s a module in the Erlang compiler application that does topsort on digraphs. Still feels like being a second-class citizen relative to the runtime’s favoured compiler.)
Rust does give you access to its internal data-structures on nightly. These change quite often, so you will need to update code that uses them pretty much every week.
Why doesn't many language does this in some stable form? Because that sets the internal data-structures and APIs of the compiler in stone forever, which makes it infinitely harder to improve the compiler and implement new language features.
Then you have Eiffel, Smalltalk and Lisp variants.
Scheme and Lisp, for example, only expose their AST. If you only care about the AST, you can access it from stable Rust, and there are great libraries for working with it, doing AST folds, semi-quoting, etc.
The OP wanted to work on the CFG graph. You could write a library to compute the CFG from the AST, but you don't have to because Rust exposes this as well. The CFG data-structures are much more tied to the intermediate representations of the Rust compiler, and these do change over time as new features are added to the language.
Some tools do use the compiler internal CFGs though. For example, rust-clippy is a linter that's built on top of most of the compiler-internal data-structures, not only type-checking, but also CFGs. The rust-semverver is a tool that detects semver breaking changes between the last released version of a library and the current one, is built on top of the type checking data-structures, and can deal with all kinds of generics (types, lifetimes, etc.).
These tools are typically tied to particular versions of the nightly compiler and do break often, but for example rust-clippy is distributed via rustup, so you always get a version that works with whatever nightly compiler you have, and well, you also get a version that "magically" works with a stable Rust compiler.
Other people have built all sort of tools on top of this, from Rust interpreters, to instrumentation passes that compute the maximum stack size requirement of embedded applications, e.g., by using the CFG to compute the deepest possible stack frame of a program, and the size of the stack frame for that case.
1. Implementation-specific thing that has no formal equivalent. Structs of structs of structs, with business logic intermingled with the structure. Plenty of examples of this. I’m not suggesting anyone put these in the stdlib; that’d be silly.
2. ADT with no formalism behind it, that implements a particular set of behaviours “for best performance”, with no guarantee about the implementation or the time/space complexity, and a set of exposed operations that only state what they do, not how they do it, such that you can’t guess what algorithm a given operation is going to use.
Example of #2: Objective-C’s NSDictionary. It only guarantees that it “acts like” a dictionary; it doesn’t guarantee that it’s O(1), because being O(1) with a high constant can actually be worse for performance in some cases. So instead, it makes no guarantees, and tries to optimize for performance by actually switching between different implementations as the data held reaches different size thresholds.
I’m not suggesting anyone put these in the stdlib either, if they currently live in a specific application, because this widens the scope of their stability requirements. Right now only the compiler maintainers can optimize this ADT into doing something entirely different, and then ensure that the compiler (the only consumer) is happy with the result; if the ADT were available in the stdlib, they wouldn’t be able to test all consumers, so they’d have to be much more conservative with their changes, lest they break an edge-case. (Changing how sorting an array ADT works, for example, can break code that assumed that sorting either nearly-sorted or entirely-randomized sets was optimal.)
And, of course, sometimes in a compiler the Control Flow Graph ADT will be of this type. If it is, fine, don’t export it. But usually—because of how compiler maintainers interact heavily with compiler theorists—it’s instead the third kind:
3. Formal data structures, where the name comes from a paper which defines the expected behaviour and a base-level implementation; where either that implementation, or another paper’s pure improvement on said implementation (without changing any of the semantics) is what you can expect to find. As well, the names of all algorithms implemented against the ADT are also well-defined in papers, such that you can know what algorithm you’re using by the name of the function. (Usually, in these cases, you’ll have multiple algorithms that do the same thing differently living as sibling functions in the same module.)
Examples of #3: the geometric primitives and index search-tree types that PostGIS exposes to C-level code. Any implementation of vector clocks. Or, my favourite: regular expressions (i.e. deterministic and nondeterministic finite automata.)
Also, example to make the naming point: for a particular use-case, the use of a segment tree might be an optimization over use of an interval tree. But in code that uses these types of formal data structures, you’d never find an ADT named “IndexTree” that could happen to be either; instead, you’d expect separate SegmentTree and IntervalTree implementations, and then maybe a strategy-pattern ADT wrapping one or the other. But the concrete implementation of the formalism is very likely to exist, because people tend to just sit down and implement these things by translating pseudocode in papers into modules; and then tend to want things to stay that way, because keeping a 1:1 mapping from the module to the paper, and then linking the paper in module-doc, is one of the only good ways to get new maintainers to understand what the heck has been implemented here.
It’s #3 that I would suggest is a good candidate for stdlib inclusion. These data structures don’t change in a way that breaks their guarantees, because their existence is predicated on a particular formalism, and nobody cares about improvements to the performance of a formalism that break the formalism. (Fun analogy: nobody would care about improvements to speed running a video game that required changing the code of the game. You’re trying to optimize that game, not some other one!)
You started arguing that the Rust compiler should expose its internal data-structures, which it does, and somehow ended arguing that the Rust standard library should expose spatial data-structures for geometry processing.
I have no idea how you got from one to the other, but writing a huge wall of text full of disorganized thoughts shows very little appreciation for the time of those participating in this discussion thread, so I won't interact with you anymore.
Doesn't mean I wouldn't like it myself, but it is a genuine drag on continued compiler development.
The nice thing about these formal data structures (or formal ADTs, I should say, though the papers themselves never tend to think of themselves as introducing an ADT) is that successive papers that find better algorithms, or make small changes to the data structure, pretty much always hold the set of operations required of the data structure constant (since that’s the only way to be able to compare these successive implementations on their performance/complexity.) So a “family” of papers about a given data structure is really a family of papers about an implicit ADT whose defined interface is the set of operations from the first paper that later papers compare themselves on.
Personally, I feel like all “formal” data structures of this type could benefit from stdlib inclusion (e.g. interval trees; ropes; CRDTs; etc.) but that’s just a personal preference.
But when a compiler is a built-in part of a runtime, with its modules already exposed as stdlib modules in a sense (being there as modules in a package namespace that you don’t have to retrieve, and which is always locked to the version of the compiler installed) such that people can actually reach in and use the compiler’s ADT for formal data-structure Foo—and people do!—and yet this growing dependency doesn’t cause the language maintainers to become worried about how they’re creating an implicit stdlib module by exposing the compiler internals in this way, but forgetting to track its usage for breaking changes... then I start to feel like it’s a bit ridiculous.
To me, language projects should be run under the Microsoft philosophy: you don’t get to decide what your stable interfaces are. If your customers are all using and depending on feature X, then feature X should be treated as a stable interface. It’s up to you as a language maintainer to discover your stable interface requirements, and promote the things people are “treating as stable”, into being managed as stable in a language-evolutionary project-management sense.
The philosophy only works because private APIs are hidden enough to reduce the effort required in maintaining backward compatibility. The fact that MS works (or rather, worked, it was much more vital in the Windows 95 transition) hard to maintain compatibility even on private dependencies doesn't mean that private doesn't mean anything.
I think a bigger (and certainly more visible) element of the MS philosophy is in maintaining compatibility for public, stable APIs. For example, look at how Windows 10 still has all these old Control Panel dialogs. That's not incidental compatibility with undocumented APIs; it's designed for extensibility, the maintenance of which holds the whole platform back to a certain degree.
I worked on the Delphi compiler and runtime library. Versioning was an exceedingly important concern. Patch releases couldn't break APIs because it could affect our partner market, if we affected the dependencies of third-party components, those third parties would need to re-release. The ecosystem only worked because there was a clear separation between public and private. You need private APIs because you need the flexibility to change things - if you can't change things, you must create everything perfectly first time, and that's just not possible.
You might want to get access to the compiler internals, but if you build something nifty with that access, and that nifty thing gets widespread use, you will hold back the entire ecosystem. You will be the cause of warts like Windows 10's 20 year old Control Panel dialogs.
If you'd like to do topological sorting you'd use one algorithm if that topological sorting is for a set of fixed tasks. Online topological sorting (that is - delta-based algorithms) used in databases would use a different algorithm and will also have different performance constraints.
I could see that in the long run we'll eventually figure out how to abstract all of the "complete/delta" stuff. I don't think we have yet figured this out yet.
I think such a type would be less useful than you'd think, for precisely the same reason why a linked list in the stdlib is pretty much useless: almost always, you don't want the stdlib allocating nodes; you want an intrusive data structure instead. Really, the problem is that the compiler needs to store extra data with the nodes that the stdlib can't know about, but you don't want two separate types `stdlib::GraphNode` and `compiler::ControlFlowNode` because you usually need to be able to convert between these types in both directions. (one direction can be handled by embedding one of the types in the other; but the reverse direction will require overhead for an extra pointer, or horribly unsafe pointer arithmetic)
Of course in Rust, there could still be a digraph trait and the stdlib could still provide generic algorithms.
Though it's also not so rare in compilers that nodes are members of multiple graphs simultaneously (with different edges in each, e.g. control flow nodes are typically not just part of the control flow graph, but also belong to a dominator tree). It's non-trivial to create a graph abstraction that can handle all these cases while remaining efficient (you don't want to put nodes in a HashSet just to check whether a graph algorithm already visited them), so it's not surprising that compiler developers don't bother and just write the algorithm directly for their particular data structures. In the end, most graph algorithms are only about a dozen lines, much simpler than the abstractions that would be required to re-use them across completely different graphs.
I guess that depends on the language; in dynamic, pure-functional languages (like Erlang) there’s nowhere you can go for further efficiency beyond “digraph expressed as labelled V and E sets held in dictionaries”, so the data structures themselves actually are quite generic, and you would never really need a separate specialization of them.
You’re probably right that in systems-level languages where you’re effectively programming a pointer / RAM word machine, you can get better specialized data structures per use-case, so writing the self-hosting compiler isn’t going to just “throw off” a handy reusable stdlib digraph library as a side-effect.
But in the cases where you’re not using a systems-level language—i.e. the cases where the graphwise algorithms you’re writing are being written against the same stdlib ADT anyway, even if you reimplemented it compiler-side—it doesn’t make sense to me to hide these algorithms away inside the compiler application package. They’re generic algorithms! (And since the ADT is a formal one from a paper, they’re probably very stable ones, too!)
As a restatement of my original premise: if your digraphs are generic, and topsort is a generic function on digraphs with only one obvious implementation given the digraph ADT you’re operating against; and given that you’ve implemented topsort for these stdlib digraphs; why aren’t you putting that algorithm in the stdlib, as a function of stdlib digraphs?
(And it’s not just a question of only putting the most common stuff into the stdlib. Erlang has lots of very arcane digraph operations only useful in proof-verification sitting around in the stdlib in the digraph_utils module. There’s no reason that operations only useful in compilers wouldn’t belong in there equally well. But instead, they’re in the compiler. Kinda weird, if you ask me.)
Don't C++'s std::list and Rust's std::collections::LinkedList effectively implement the moral equivalent (insofar as memory layout is concerned) of an intrusive list, though?
You can't call `std::list<Node>::erase()` with a Node* , and you can't write a function that converts from Node* to `std::list<Node>::iterator` in standard C++, even though it's really just a matter of subtracting a fixed number of bytes from the pointer value. So you instead need to store the `std::list<Node>::iterator` in the `Node`, wasting memory.
You can kind of cheat by always using `std::list<Node>::iterator` instead of Node* , but that means you can't use normal methods, because the `this` pointer will lack access to the iterator. And you always need to pass around a pair of `std::list&` and iterator, because the iterator alone isn't sufficient to delete the node or insert elements before/after it. Usually the code ends up being a lot simpler if class Node just reimplements a linked list.
And that's still the simple case where each node is only contained in a single data structure at a time, which (at least in compilers) is rarely sufficient.
By the same token, I'd like to see dynamic languages with a tracing JIT, show what types are used in the actual runtime. There was an article posted to HN a few years back, where some researchers noted that most dynamic language servers seem to go through an initial startup period, where there's some dynamic type shenanigans, then settle down to a state where the types are basically static.
C# has https://docs.microsoft.com/en-us/dotnet/csharp/programming-g... (only partially what you want I think)