The AST Typing Problem (opens in new tab)

(blog.ezyang.com)

74 pointssmasher1642y ago32 comments

32 comments

An even easier way is to add type parameter to every AST that resolves to either a type or null. It can be relatively easily implemented using type families in Haskell. And it doesn't suffer the tying the knot shenanigans. The main advantage is that you can still reason about your AST in a straight forward generic manner. In fact this is not only useful for typing. Any kind of metadata that needs to be added can be described in this way.

In a compiler I build you had a simple ASTState type parameter for every node in the AST which indicated what is available in the AST. But you could traverse over the entire AST generically as long as you did not depend on any of the additional information.

You can basically lift the implementation from this paper: https://www.microsoft.com/en-us/research/uploads/prod/2016/1...

jez2y ago

Am I correct in understanding that this means you can only have a list of typed ASTs, or a list of not-yet-typed ASTs, but not a list with a mix?

Obviously I understand that if this is the limitation, then there are ways to work around this, and it all depends on how annoying the workarounds are.

But I’m curious is there a trick that allows it. All the tricks I can think of involve boxing the the ASTs you’re about to store in the list (which then come with memory/performance hits), and I’d love to be proven wrong.

jmgrosen2y ago

Use an existential type, usually called Some in Haskell: https://hackage.haskell.org/package/some-1.0.5/docs/Data-Som... The implementation of this type one has in their head (a GADT) adds a boxing overhead, but the actual implementation in this library uses a newtype.

That way if you have a type Expr a, you can have a list type [Some Expr].

rowanG0772y ago

You could make an intermediate step that's only used during type checking which may have partial type information. Then after type checking you lock it down.

You can even have HLists but then you are also pretty far along the Ivory tower. I don't think that's worth it.

nemo16182y ago

I'll contribute another solution, used in the Go compiler: build a big map from AST node -> typeinfo. This approach is very flexible (e.g. you can add more maps later without changing the AST struct), but, like Nullable, it's not correct-by-construction -- and worse, you "just have to know" that this map exists and how to access/update it. Seems to be working ok for them, though!

mbid2y ago

This is also the natural solution when you're using Datalog to compute with the AST: Datalog operates on relations/tables, so associating types to AST nodes can be accomplished using a table with two columns: One for the ID representing an AST node, and one for the type associated to the AST node. I've recently written about this approach in the series starting here: https://www.mbid.me/posts/type-checking-with-eqlog-parsing/

chubot2y ago

Yeah this is what MyPy does internally too

You just have a big dict of types, which are keyed by the AST node

It seemed a bit weird to me, but certainly works. I guess I usually use dicts that are keyed by value and not by address of the instance

barrkel2y ago

The way I prefer to solve this problem these days is to use a fairly generic tree node structure (fairly close to sexprs) and implement tree grammars which check for correctness of different phases (e.g. verify that nodes are annotated with types, or syntax sugar has been decomposed).

Being playful, I've generally implemented the tree grammars as literal constructions of the tree nodes themselves, so the grammars can self-check.

I've use this approach in Ruby and Java though, not functional languages with pattern matching and destructuring, so I wasn't leaving those powerful tools on the table.

In particular I have given up trying to represent ASTs using classes. I don't think they deliver a lot of value. Behavior generally belongs outside the class and not inside, transformations should be collected together as an algorithm, rather than distributed across multiple definitions. Double dispatch for type-safe tree navigation is unpleasant and you lose control over typing of context. You end up with explicit typecasts and you've lost what little type safety you had, and you still can't enforce constraints around how the tree changes from phase to phase.

chrisaycock2y ago

The beauty of separate IRs is that each can represent a separate stage of compilation. For example, Rust goes from abstract syntax tree (AST) to high-level IR (HIR) to typed high-level IR (THIR):

https://rustc-dev-guide.rust-lang.org/part-3-intro.html

Additionally, Many language implementations use a mid-level IR to sit above LLVM: Rust has MIR, Swift has SIL, Julia has an SSA-form IR, etc. The article mentions GHC's Core, which is also an example of an IR that comes after type checking but before LLVM.

retrac2y ago

Multiple IRs can get verbose, but it also helps compartmentalize complexity.

In a toy C compiler I wrote, I had a pass that did a transform of structs and arrays, turning everything into pointer arithmetic -- only thing that pass did. After that phase, the IR doesn't need to know about arrays or structs.

Though I found a downside with that method. If i encountered an error after parsing, since I had repeatedly transformed the program, it was impossible to give a meaningful error message. Which line in the source or which source-level feature corresponds to the broken AST node? That kind of info is just as tricky to carry around in the AST as type tagging.

lionkor2y ago

https://rustc-dev-guide.rust-lang.org/rustc-driver-interacti...

Love seeing that even Rust compiler devs are forced to be verbose and dont bother breaking some of this code out into functions, because specifying the constraints is too much work

jeswin2y ago

> How does one go about adding “extra information” (e.g. types) to an AST?

Can someone who's more familiar with compilers ELI10 the problem? A layman like me is left wondering - Why not add (say) a type field to each node?

jez2y ago

Your solution is the second alternative proposed in the article: the “Nullable field” solution.

It’s an example of a classic tradeoff between enforcing correctness and invariants “by construction” vs at runtime. Since the field is nullable, the AST is not necessarily correct by construction, as the programmer may have forgotten to include a type when required. The correctness of compilers using such representations involves either checked or unchecked assertions where the type field is used to ensure that it is set appropriately.

The article continues with various strategies to avoid needing to defer this correctness checking to the runtime by providing various statically-checked alternatives.

Joker_vD2y ago

Because there is a lot of additional stuff you may want to add to a node at the later stages of translation, not only the type field: you may also want a usage count, reference from usages, reference to the defining function, global/local value numbers, "can be an immediate in an instruction/requires load from constant pool" flags, the allocated register, the allocated spill slot, etc. Surely you don't put all of that backend-specific (or even single-pass specific) stuff into a supposedly generic AST node that your frontend produces?

rowanG0772y ago

This is not possible since this information is not available at all times. For example imagine C++ where a variable is typed with auto. Just after parsing you have an AST which doesn't yet include the real static type the variable has.

TOGoS2y ago

Make each AST node an RDF node and then you can cram whatever information into it you want. That's the approach I've been taking with https://github.com/TOGoS/TOGVM-Spec/, anyway.

Of course, for conveniently and safely manipulating in memory in $programming_language, you're probably going to want to define some structs/ADTs/whatever that only contain the data a given compilation stage is actively working with.

I've been thinking that what I need is a system that allows me to quickly define different lower-level datatypes for representing different views of the conceptual types and automate, to some degree, translation between them, so then each part of the system can work with objects designed specifically to be processed by it with minimal fuss.

A technical reason for avoiding those specialized types might be that the computer then has to spend more time transforming from one schema to the next. I would think that in practice this isn't any worse than having to do a lot of null checks.

A more human reason is that it could bean a combinatorical explosion of AST types. I guess this is where my idea about lightweight variations comes in.

In TypeScript this kind of thing might not be so bad, since any object can be downcast with no cost to a type that contains a subset of the information, and variations on types can be easily defined without even necessarily being named, e.g. `ASTNode & HasResultType & HasSourceLocation`.

nextaccountic2y ago

This article really needs an (2013), because of

> Both GHC’s core IR, Ur/Web's core and Coq are explicitly typed in this way.

This is how GHC used to do, but in 2018 GHC adapted the pattern "trees that grow"

The paper https://www.jucs.org/jucs_23_1/trees_that_grow/jucs_23_01_00... (the paper is actually from 2017)

Light GHC docs on it https://gitlab.haskell.org/ghc/ghc/-/wikis/implementing-tree... (I recall seeing more through docs somewhere). I will quote

> The new HsSyn AST supporting the TTG idiom (from now on referred to as TTG HsSyn) is designed to subsume five different representations of Haskell syntax:

> AST GhcPs: the AST used in GHC's parsing phase > AST GhcRn: the AST used in GHC's renaming phase > AST GhcTc: the AST used in GHC's typechecking phase > AST TH: the AST used in Template Haskell > AST HSE: the AST used in an external tool such as Haskell-Src-Exts

This means that you have a single generic tree for all of those passes, but each pass has a different instantiation with has custom data

Also, some discussion of applications https://www.reddit.com/r/haskell/comments/7zo9mb/any_applica... and slides here http://data.tmorris.net/talks/trees-that-grow/bc4ae902ca1d3b...

The thing about this is to observe that, in the blog post in the OP, when you added data to a given type, it was inserted as a concrete type (named Type here), that is, you went from

    data Exp = Num Int
             | Bool Bool
             | Var Var
             | If Exp Exp Exp
             | Lambda Var Exp
             | App Exp Exp

    type TExp = (TExp', Type)
    data TExp' = TNum Int
               | TBool Bool
               | TVar Var
               | TIf TExp TExp TExp
               | TLambda Var TExp
               | TApp TExp TExp

But you could have kept it a generic type (here called a, rather than Type). This is a common pattern in both Haskell and Rust (and maybe other languages), and isn't the Trees That Grow pattern just yet, but it's like this:

    type GenericTExp a = (GenericTExp' a, a)
    data GenericTExp' a = TNum Int
               | TBool Bool
               | TVar Var
               | TIf TExp (GenericTExp a) (GenericTExp a)
               | TLambda Var (GenericTExp a)
               | TApp TExp (GenericTExp a)

Now you can recover the original TExpr with

    type TExpr = GenericTExpr Type

And you can recover the bare Expr with no added data type by putting () (called unit type) there

    type Expr = GenericTExpr ()

Now, the paper also deals with the problem of, how to add new variants to a data type (it has a | Something that is open ended; this is called "extension constructor" in the ghc wiki, and denoted by XExpr), and how to add different data types to different nodes (you need to, instead of directly adding Type to the generic parameter, add a marker and use type-level functions - called type families in Haskell - to select the data type you're adding). When you figure out all those complexities, you end up with a type-level machinery that will inevitably be similar to the TTG idiom

teo_zero2y ago

Am I the only one lost at the very first paragraph? It shows something that looks like a grammar, but I can't understand what "Num", "Int", "If" are... Are they literals? terms? types? And what does "data" mean?

So it's not a grammar. "data Exp" might describe each internal node of the AST, where the first word is the node's variety and the following words describe the children nodes. But what does "data Type" mean, then?

youraverageuser2y ago

No pun intended: this is called datatype, algebraic data types, ADT, etc. They are popular in many functional languages.

You can think of them as enumerators that can possibly hold extra data.

The first definition (data Exp) is indeed a grammar. Thus "Num", "Int", "If" are just the names we give to different cases we encounter in the grammar. As for what they are... you can think of them as functions that wrap some data inside a struct of the same name (function names and type names don't collide). As for the name, they are called data constructors.

As for what "data" means, we are defining a new type which can hold some data (again). You can think of it as a synonym for "enum". The reason word "data" used here is because there's something called type synonyms. For example one can write:

type Email = String

type Password = String

Here the underlying representation for both email and password is string, but compiler treats Email and Password as another name for string. But if you were to use data constructors:

data Email = Email String

data Password = Password String

We are creating entirely new types. Even though the internal representation is still string, if we were to supply password when an email is expected, we get a compile error.

Essentially a wrapper class can do the same. But using data constructors we get the same benefit + something called pattern matching + exhaustive error checks + cleaner syntax + additional low-level optimizations by compiler.

ghusbands2y ago

> The first definition (data Exp) is indeed a grammar

No, it isn't. It's an AST datatype. That definition does not at all cover the relationship between text and structure, as a grammar does.

hofstee2y ago

  data Exp = Num Int
           | Bool Bool
           | Var Var
           | If Exp Exp Exp
           | Lambda Var Exp
           | App Exp Exp

An Expression is either:

- a Number literal, with an Integer value

- a Boolean literal, with a Boolean value

- a Variable declaration, with some identifier

- an If expression (probably), with a condition, taken, and not-taken Expressions

- a Lambda, which substitutes a Variable into an Expression

- an Apply, which is how we chain Expressions and apply the results of one expression onto the next

  data Type = TyInt | TyBool | TyArrow Type Type

Our Types are TypeInt, TypeBool, or a Function (TyArrow) which goes from an input Type to an output Type

ghusbands2y ago

See https://wiki.haskell.org/Type for a brief description and the following for something similar to the article: https://wiki.haskell.org/Parsing_expressions_and_statements

crabbone2y ago

That looks like Haskell source code. Well, maybe some similar language.

If that's indeed Haskell, then those words that start with a capital letter are "data constructors". Well, think a decent language will usually have one or two data constructors: cons for lists and something to make arrays. But, in Haskell, for some reason... they allow you to define your own.

Functionally, the thing is a preparation for the follow-up code that matches some data bits to one of these "classes" and executes some bits of code if the "class" matched.

pharmakom2y ago

I think the post assumes some ML background (here OCaml). it is defining a type for the AST in a compiler.

pjmlp2y ago

I guess the author wrote the article with an audience that understands ML linage of programming languages.

I had no issue understanding it is an AST for lambda calculus, as it is clearly mentioned, however I know these kind of languages since Caml Light.

gpderetta2y ago

What about separate IR that incrementally add to the base IR without replacing it? They will have the same shape so they can be encoded efficiently I think, and can be traversed in parallel.

BSEdlMMldESB2y ago

2013!

what's the difference between running a program (interpreting it), and compiling it?

where's (and what's) the line that distinguishes the program from the output of the program where both parts are just software?

the most interesting thing about turning machines is the universal turing machine because it actually invents/defines/showcases software as we understand it nowadays; this makes me reflect on how the busy beaber problem just misses out on this realization

3cats-in-a-coat2y ago

Restrictive static typing creates a lot of problems that wouldn't exist... if you didn't use restrictive static typing. Maybe because my language designs are toys I use in my projects (for now), but everything is nested heterogenous sparse sequences, so it's a list, but also a map, and I can attach any associative keys to it without breaking the pattern match on the rest of it, and I'm never out of place to add information should I need it.

I think the problem is we write compilers like it's the 90s, and it's not the 90s.

But also it's full of data structure solutions to this even for static types: give nodes a simple autoincrementing id (if your language has no object maps), then put any extra information in a sidecar if you want. Think relations and joins. It's easy.

Joker_vD2y ago

Yes, and you can also just use parallel arrays of ints (or of constant-sized arrays of ints), because who needs structs with fields, really?

    #define  NUM  0
    #define  BOOL 1
    #define  VAR  2
    #define  IF   3
    #define  LAM  4
    #define  APP  5

    int  nodes[MAX_NODES][4];
    int  next_node;

    int mkIf(int condExp, int thenExp, int elseExp) {
        assert((condExp < next_node) && (thenExp < next_node) && (elseExp < next_node));
        nodes[next_node][0] = IF;
        nodes[next_node][1] = condExp;
        nodes[next_node][2] = thenExp; 
        nodes[next_node][3] = elseExp;
        return next_node++;
    }

Need additional data to stick to a node? Declare another array of MAX_NODES length and put it there! That's how Real Programmers wrote compilers in the seventies because real programmers write FORTRAN.

3cats-in-a-coat2y ago

I'm unsure what's the point of your satire. Sidecars are not just something people did in the 70s, it's a pervasive pattern seen in databases (there known as columnar storage), distributed services, gaming engines (like Entity Component System architectures) and so on.

What are you even making fun of?

Joker_vD2y ago

That's not a satire. That's how people used to write code. Heck, I wrote a toy multi-pass compiler exactly like that in a B-like toy language that only had ints and arrays of (arrays of...) ints for data.

But that's incredibly old-fashioned; I am making fun of your "we write compilers like it's the 90s, and it's not the 90s" because you propose the technique from the early 70s. It's not the 70s.

1 more reply

j / k navigate · click thread line to collapse

32 comments

rowanG0772y ago

You can basically lift the implementation from this paper: https://www.microsoft.com/en-us/research/uploads/prod/2016/1...

jez2y ago

Am I correct in understanding that this means you can only have a list of typed ASTs, or a list of not-yet-typed ASTs, but not a list with a mix?

Obviously I understand that if this is the limitation, then there are ways to work around this, and it all depends on how annoying the workarounds are.

jmgrosen2y ago

That way if you have a type Expr a, you can have a list type [Some Expr].

rowanG0772y ago

You could make an intermediate step that's only used during type checking which may have partial type information. Then after type checking you lock it down.

You can even have HLists but then you are also pretty far along the Ivory tower. I don't think that's worth it.

nemo16182y ago

mbid2y ago

chubot2y ago

Yeah this is what MyPy does internally too

You just have a big dict of types, which are keyed by the AST node

It seemed a bit weird to me, but certainly works. I guess I usually use dicts that are keyed by value and not by address of the instance

barrkel2y ago

Being playful, I've generally implemented the tree grammars as literal constructions of the tree nodes themselves, so the grammars can self-check.

I've use this approach in Ruby and Java though, not functional languages with pattern matching and destructuring, so I wasn't leaving those powerful tools on the table.

chrisaycock2y ago

The beauty of separate IRs is that each can represent a separate stage of compilation. For example, Rust goes from abstract syntax tree (AST) to high-level IR (HIR) to typed high-level IR (THIR):

https://rustc-dev-guide.rust-lang.org/part-3-intro.html

retrac2y ago

Multiple IRs can get verbose, but it also helps compartmentalize complexity.

lionkor2y ago

https://rustc-dev-guide.rust-lang.org/rustc-driver-interacti...

Love seeing that even Rust compiler devs are forced to be verbose and dont bother breaking some of this code out into functions, because specifying the constraints is too much work

jeswin2y ago

> How does one go about adding “extra information” (e.g. types) to an AST?

Can someone who's more familiar with compilers ELI10 the problem? A layman like me is left wondering - Why not add (say) a type field to each node?

jez2y ago

Your solution is the second alternative proposed in the article: the “Nullable field” solution.

The article continues with various strategies to avoid needing to defer this correctness checking to the runtime by providing various statically-checked alternatives.

Joker_vD2y ago

rowanG0772y ago

TOGoS2y ago

Make each AST node an RDF node and then you can cram whatever information into it you want. That's the approach I've been taking with https://github.com/TOGoS/TOGVM-Spec/, anyway.

A more human reason is that it could bean a combinatorical explosion of AST types. I guess this is where my idea about lightweight variations comes in.

nextaccountic2y ago

This article really needs an (2013), because of

> Both GHC’s core IR, Ur/Web's core and Coq are explicitly typed in this way.

This is how GHC used to do, but in 2018 GHC adapted the pattern "trees that grow"

The paper https://www.jucs.org/jucs_23_1/trees_that_grow/jucs_23_01_00... (the paper is actually from 2017)

Light GHC docs on it https://gitlab.haskell.org/ghc/ghc/-/wikis/implementing-tree... (I recall seeing more through docs somewhere). I will quote

> The new HsSyn AST supporting the TTG idiom (from now on referred to as TTG HsSyn) is designed to subsume five different representations of Haskell syntax:

This means that you have a single generic tree for all of those passes, but each pass has a different instantiation with has custom data

Also, some discussion of applications https://www.reddit.com/r/haskell/comments/7zo9mb/any_applica... and slides here http://data.tmorris.net/talks/trees-that-grow/bc4ae902ca1d3b...

The thing about this is to observe that, in the blog post in the OP, when you added data to a given type, it was inserted as a concrete type (named Type here), that is, you went from

    data Exp = Num Int
             | Bool Bool
             | Var Var
             | If Exp Exp Exp
             | Lambda Var Exp
             | App Exp Exp

    type TExp = (TExp', Type)
    data TExp' = TNum Int
               | TBool Bool
               | TVar Var
               | TIf TExp TExp TExp
               | TLambda Var TExp
               | TApp TExp TExp

    type GenericTExp a = (GenericTExp' a, a)
    data GenericTExp' a = TNum Int
               | TBool Bool
               | TVar Var
               | TIf TExp (GenericTExp a) (GenericTExp a)
               | TLambda Var (GenericTExp a)
               | TApp TExp (GenericTExp a)

Now you can recover the original TExpr with

    type TExpr = GenericTExpr Type

And you can recover the bare Expr with no added data type by putting () (called unit type) there

    type Expr = GenericTExpr ()

teo_zero2y ago

youraverageuser2y ago

No pun intended: this is called datatype, algebraic data types, ADT, etc. They are popular in many functional languages.

You can think of them as enumerators that can possibly hold extra data.

type Email = String

type Password = String

Here the underlying representation for both email and password is string, but compiler treats Email and Password as another name for string. But if you were to use data constructors:

data Email = Email String

data Password = Password String

We are creating entirely new types. Even though the internal representation is still string, if we were to supply password when an email is expected, we get a compile error.

ghusbands2y ago

> The first definition (data Exp) is indeed a grammar

No, it isn't. It's an AST datatype. That definition does not at all cover the relationship between text and structure, as a grammar does.

hofstee2y ago

  data Exp = Num Int
           | Bool Bool
           | Var Var
           | If Exp Exp Exp
           | Lambda Var Exp
           | App Exp Exp

An Expression is either:

- a Number literal, with an Integer value

- a Boolean literal, with a Boolean value

- a Variable declaration, with some identifier

- an If expression (probably), with a condition, taken, and not-taken Expressions

- a Lambda, which substitutes a Variable into an Expression

- an Apply, which is how we chain Expressions and apply the results of one expression onto the next

  data Type = TyInt | TyBool | TyArrow Type Type

Our Types are TypeInt, TypeBool, or a Function (TyArrow) which goes from an input Type to an output Type

ghusbands2y ago

See https://wiki.haskell.org/Type for a brief description and the following for something similar to the article: https://wiki.haskell.org/Parsing_expressions_and_statements

crabbone2y ago

That looks like Haskell source code. Well, maybe some similar language.

Functionally, the thing is a preparation for the follow-up code that matches some data bits to one of these "classes" and executes some bits of code if the "class" matched.

pharmakom2y ago

I think the post assumes some ML background (here OCaml). it is defining a type for the AST in a compiler.

pjmlp2y ago

I guess the author wrote the article with an audience that understands ML linage of programming languages.

I had no issue understanding it is an AST for lambda calculus, as it is clearly mentioned, however I know these kind of languages since Caml Light.

gpderetta2y ago

What about separate IR that incrementally add to the base IR without replacing it? They will have the same shape so they can be encoded efficiently I think, and can be traversed in parallel.

BSEdlMMldESB2y ago

2013!

what's the difference between running a program (interpreting it), and compiling it?

where's (and what's) the line that distinguishes the program from the output of the program where both parts are just software?

3cats-in-a-coat2y ago

I think the problem is we write compilers like it's the 90s, and it's not the 90s.

Joker_vD2y ago

Yes, and you can also just use parallel arrays of ints (or of constant-sized arrays of ints), because who needs structs with fields, really?

    #define  NUM  0
    #define  BOOL 1
    #define  VAR  2
    #define  IF   3
    #define  LAM  4
    #define  APP  5

    int  nodes[MAX_NODES][4];
    int  next_node;

    int mkIf(int condExp, int thenExp, int elseExp) {
        assert((condExp < next_node) && (thenExp < next_node) && (elseExp < next_node));
        nodes[next_node][0] = IF;
        nodes[next_node][1] = condExp;
        nodes[next_node][2] = thenExp; 
        nodes[next_node][3] = elseExp;
        return next_node++;
    }

3cats-in-a-coat2y ago

What are you even making fun of?

Joker_vD2y ago

But that's incredibly old-fashioned; I am making fun of your "we write compilers like it's the 90s, and it's not the 90s" because you propose the technique from the early 70s. It's not the 70s.

1 more reply

j / k navigate · click thread line to collapse