In a compiler I build you had a simple ASTState type parameter for every node in the AST which indicated what is available in the AST. But you could traverse over the entire AST generically as long as you did not depend on any of the additional information.
You can basically lift the implementation from this paper: https://www.microsoft.com/en-us/research/uploads/prod/2016/1...
Obviously I understand that if this is the limitation, then there are ways to work around this, and it all depends on how annoying the workarounds are.
But I’m curious is there a trick that allows it. All the tricks I can think of involve boxing the the ASTs you’re about to store in the list (which then come with memory/performance hits), and I’d love to be proven wrong.
That way if you have a type Expr a, you can have a list type [Some Expr].
You can even have HLists but then you are also pretty far along the Ivory tower. I don't think that's worth it.
You just have a big dict of types, which are keyed by the AST node
It seemed a bit weird to me, but certainly works. I guess I usually use dicts that are keyed by value and not by address of the instance
Being playful, I've generally implemented the tree grammars as literal constructions of the tree nodes themselves, so the grammars can self-check.
I've use this approach in Ruby and Java though, not functional languages with pattern matching and destructuring, so I wasn't leaving those powerful tools on the table.
In particular I have given up trying to represent ASTs using classes. I don't think they deliver a lot of value. Behavior generally belongs outside the class and not inside, transformations should be collected together as an algorithm, rather than distributed across multiple definitions. Double dispatch for type-safe tree navigation is unpleasant and you lose control over typing of context. You end up with explicit typecasts and you've lost what little type safety you had, and you still can't enforce constraints around how the tree changes from phase to phase.
https://rustc-dev-guide.rust-lang.org/part-3-intro.html
Additionally, Many language implementations use a mid-level IR to sit above LLVM: Rust has MIR, Swift has SIL, Julia has an SSA-form IR, etc. The article mentions GHC's Core, which is also an example of an IR that comes after type checking but before LLVM.
In a toy C compiler I wrote, I had a pass that did a transform of structs and arrays, turning everything into pointer arithmetic -- only thing that pass did. After that phase, the IR doesn't need to know about arrays or structs.
Though I found a downside with that method. If i encountered an error after parsing, since I had repeatedly transformed the program, it was impossible to give a meaningful error message. Which line in the source or which source-level feature corresponds to the broken AST node? That kind of info is just as tricky to carry around in the AST as type tagging.
Love seeing that even Rust compiler devs are forced to be verbose and dont bother breaking some of this code out into functions, because specifying the constraints is too much work
Can someone who's more familiar with compilers ELI10 the problem? A layman like me is left wondering - Why not add (say) a type field to each node?
It’s an example of a classic tradeoff between enforcing correctness and invariants “by construction” vs at runtime. Since the field is nullable, the AST is not necessarily correct by construction, as the programmer may have forgotten to include a type when required. The correctness of compilers using such representations involves either checked or unchecked assertions where the type field is used to ensure that it is set appropriately.
The article continues with various strategies to avoid needing to defer this correctness checking to the runtime by providing various statically-checked alternatives.
Of course, for conveniently and safely manipulating in memory in $programming_language, you're probably going to want to define some structs/ADTs/whatever that only contain the data a given compilation stage is actively working with.
I've been thinking that what I need is a system that allows me to quickly define different lower-level datatypes for representing different views of the conceptual types and automate, to some degree, translation between them, so then each part of the system can work with objects designed specifically to be processed by it with minimal fuss.
A technical reason for avoiding those specialized types might be that the computer then has to spend more time transforming from one schema to the next. I would think that in practice this isn't any worse than having to do a lot of null checks.
A more human reason is that it could bean a combinatorical explosion of AST types. I guess this is where my idea about lightweight variations comes in.
In TypeScript this kind of thing might not be so bad, since any object can be downcast with no cost to a type that contains a subset of the information, and variations on types can be easily defined without even necessarily being named, e.g. `ASTNode & HasResultType & HasSourceLocation`.
> Both GHC’s core IR, Ur/Web's core and Coq are explicitly typed in this way.
This is how GHC used to do, but in 2018 GHC adapted the pattern "trees that grow"
The paper https://www.jucs.org/jucs_23_1/trees_that_grow/jucs_23_01_00... (the paper is actually from 2017)
Light GHC docs on it https://gitlab.haskell.org/ghc/ghc/-/wikis/implementing-tree... (I recall seeing more through docs somewhere). I will quote
> The new HsSyn AST supporting the TTG idiom (from now on referred to as TTG HsSyn) is designed to subsume five different representations of Haskell syntax:
> AST GhcPs: the AST used in GHC's parsing phase > AST GhcRn: the AST used in GHC's renaming phase > AST GhcTc: the AST used in GHC's typechecking phase > AST TH: the AST used in Template Haskell > AST HSE: the AST used in an external tool such as Haskell-Src-Exts
This means that you have a single generic tree for all of those passes, but each pass has a different instantiation with has custom data
Also, some discussion of applications https://www.reddit.com/r/haskell/comments/7zo9mb/any_applica... and slides here http://data.tmorris.net/talks/trees-that-grow/bc4ae902ca1d3b...
The thing about this is to observe that, in the blog post in the OP, when you added data to a given type, it was inserted as a concrete type (named Type here), that is, you went from
data Exp = Num Int
| Bool Bool
| Var Var
| If Exp Exp Exp
| Lambda Var Exp
| App Exp Exp
to type TExp = (TExp', Type)
data TExp' = TNum Int
| TBool Bool
| TVar Var
| TIf TExp TExp TExp
| TLambda Var TExp
| TApp TExp TExp
But you could have kept it a generic type (here called a, rather than Type). This is a common pattern in both Haskell and Rust (and maybe other languages), and isn't the Trees That Grow pattern just yet, but it's like this: type GenericTExp a = (GenericTExp' a, a)
data GenericTExp' a = TNum Int
| TBool Bool
| TVar Var
| TIf TExp (GenericTExp a) (GenericTExp a)
| TLambda Var (GenericTExp a)
| TApp TExp (GenericTExp a)
Now you can recover the original TExpr with type TExpr = GenericTExpr Type
And you can recover the bare Expr with no added data type by putting () (called unit type) there type Expr = GenericTExpr ()
Now, the paper also deals with the problem of, how to add new variants to a data type (it has a | Something that is open ended; this is called "extension constructor" in the ghc wiki, and denoted by XExpr), and how to add different data types to different nodes (you need to, instead of directly adding Type to the generic parameter, add a marker and use type-level functions - called type families in Haskell - to select the data type you're adding). When you figure out all those complexities, you end up with a type-level machinery that will inevitably be similar to the TTG idiomSo it's not a grammar. "data Exp" might describe each internal node of the AST, where the first word is the node's variety and the following words describe the children nodes. But what does "data Type" mean, then?
You can think of them as enumerators that can possibly hold extra data.
The first definition (data Exp) is indeed a grammar. Thus "Num", "Int", "If" are just the names we give to different cases we encounter in the grammar. As for what they are... you can think of them as functions that wrap some data inside a struct of the same name (function names and type names don't collide). As for the name, they are called data constructors.
As for what "data" means, we are defining a new type which can hold some data (again). You can think of it as a synonym for "enum". The reason word "data" used here is because there's something called type synonyms. For example one can write:
type Email = String
type Password = String
Here the underlying representation for both email and password is string, but compiler treats Email and Password as another name for string. But if you were to use data constructors:
data Email = Email String
data Password = Password String
We are creating entirely new types. Even though the internal representation is still string, if we were to supply password when an email is expected, we get a compile error.
Essentially a wrapper class can do the same. But using data constructors we get the same benefit + something called pattern matching + exhaustive error checks + cleaner syntax + additional low-level optimizations by compiler.
No, it isn't. It's an AST datatype. That definition does not at all cover the relationship between text and structure, as a grammar does.
data Exp = Num Int
| Bool Bool
| Var Var
| If Exp Exp Exp
| Lambda Var Exp
| App Exp Exp
An Expression is either:- a Number literal, with an Integer value
- a Boolean literal, with a Boolean value
- a Variable declaration, with some identifier
- an If expression (probably), with a condition, taken, and not-taken Expressions
- a Lambda, which substitutes a Variable into an Expression
- an Apply, which is how we chain Expressions and apply the results of one expression onto the next
data Type = TyInt | TyBool | TyArrow Type Type
Our Types are TypeInt, TypeBool, or a Function (TyArrow) which goes from an input Type to an output TypeIf that's indeed Haskell, then those words that start with a capital letter are "data constructors". Well, think a decent language will usually have one or two data constructors: cons for lists and something to make arrays. But, in Haskell, for some reason... they allow you to define your own.
Functionally, the thing is a preparation for the follow-up code that matches some data bits to one of these "classes" and executes some bits of code if the "class" matched.
I had no issue understanding it is an AST for lambda calculus, as it is clearly mentioned, however I know these kind of languages since Caml Light.
what's the difference between running a program (interpreting it), and compiling it?
where's (and what's) the line that distinguishes the program from the output of the program where both parts are just software?
--
the most interesting thing about turning machines is the universal turing machine because it actually invents/defines/showcases software as we understand it nowadays; this makes me reflect on how the busy beaber problem just misses out on this realization
I think the problem is we write compilers like it's the 90s, and it's not the 90s.
But also it's full of data structure solutions to this even for static types: give nodes a simple autoincrementing id (if your language has no object maps), then put any extra information in a sidecar if you want. Think relations and joins. It's easy.
#define NUM 0
#define BOOL 1
#define VAR 2
#define IF 3
#define LAM 4
#define APP 5
int nodes[MAX_NODES][4];
int next_node;
int mkIf(int condExp, int thenExp, int elseExp) {
assert((condExp < next_node) && (thenExp < next_node) && (elseExp < next_node));
nodes[next_node][0] = IF;
nodes[next_node][1] = condExp;
nodes[next_node][2] = thenExp;
nodes[next_node][3] = elseExp;
return next_node++;
}
Need additional data to stick to a node? Declare another array of MAX_NODES length and put it there! That's how Real Programmers wrote compilers in the seventies because real programmers write FORTRAN.What are you even making fun of?