undefined | Better HN

0 pointsHexDecOctBin1y ago0 comments

How would you do incremental parsing in a recursive descent parser, especially if you don't control the language? All this research is fine, but most parsers actually being used are handwritten whether GCC, Clang or (I believe) MSVC.

0 comments

DannyBee1y ago

What do you mean?

It works fine in handwritten recursive-descent parsers, which 99% of the time pass around a state object.

Assume (to simplify enough to write this out quickly for HN) you have the following:

1. A token stream with markings of which tokens changed on the last lex.

2. A handwritten recursive descent parser with a state object passed along.

3. A copy of the last AST generated, with tokens linked into the stream, and the AST being built from pieces by the RD parser as it goes.

4. A function called LA(N) that does lookahead, and a function called LB(N) that does lookbehind

5. We assume that each RD function returns the portion of the AST it built and they get linked together bottom up.

6. We further assume that semantic decisions of whether to descend and of generating portions of the AST are deterministic, and depend only on the tokens in some fashion, and that they use the LA/LB interface to check tokens. None of this is required to make it work, it just makes the HN version simpler[1])

We'll now separate this into the "first parse" and "every other parse" to make it simpler to describe, but it is easy enough to combine.

On first parse, every time LA(n) or LB(n) is called, we mark in the state object that our particular recursive descent function looked at that token (in practice this gets done by tracking the range of tokens looked at).

Overall, We track the min/max range of tokens looked at by a given function the same way you'd do dfs in/out numbers:

  fun parse_this_part(state):
      starting_min = current_token_index
      do some LA/LB/recursive calls/work/whatever
      for any recursive call,
            update child_min to minimum of any child call, and child_max to max of any child call.


      state[parse_this_part].max = max(current_token_index, child_max)
      state[parse_tihs_part].min = min(starting_min, child_min)
      return ast portion we have built up from our calls/work.

On every additional parse, we do that work plus the following:

As we descend, before making a recursive call, check if the tokens in the min/max range of the function about to be called changed. If so, keep descending, and re-parse that portion. If not, return the cached ast, as it cannot have changed.

You will now re-parse only the portions that could have possibly been affected by a change, and since we check before any descent, we will reparse as small a portion as we can for the level of granularity we have on change detection (IE if we track ranges, we will reparse as small a portion as possible that are within those ranges. If we track individual token changes, we will reparse as small a portion as possible, period)

In practice, the vast majority of grammars don't get above n=3 lookahead when doing anything real. most are n=1, some are n=2, n=3 or above is very rare.

You can do better than this in a number of ways: 1. You don't have to keep before/after AST's explicitly

2. If you want explicit per-token tracking, you can make sets of the tokens that got looked at instead of ranges. You can use sparse bitsets and token ids to speed up checking (reducing the problem to whether the sparse bitmap intersection of all_changed_tokens and this_functions_looked_at_tokens is empty)

This is generally not worth the overhead in practice, but it is optimal.

3. Usually things like epoch numbers are added so you can tell which parts of the AST/tree came from which parses for debugging.

4. If you store the min/max ranges in the obvious way (in the parser state, and using the absolute token position), you have to update them on each parse if anything before you added/removed tokens. So in this naive implementation, you still have to do min/max updates as you recurse, but none of the other work. In less naive implementations, you can store the min/max ranges as relative so that you don't have to do this. It usually requires modifying the LA/LB functions to take more state, etc.

If you want to see how it looks in something like ANTLR 4 typescript (there is a regular ANTLR impl too, but i think the typescript version is easier to read), see https://github.com/dberlin/antlr4ts/blob/incremental/src/Inc... and the surrounding files.

I did not try to optimize it really, it was being used in a vscode extension to parse gcode files, which are often very very large for complex designs, but only small portions are usually changed in an editor. For that case, range/etc tracking was fine.

It should be pretty easy to read and see how you would apply this to a hand-written parser. In fact, I would guess you could reuse the vast majority of it pretty directly.

[1] To relax this restriction, you just need to similarly track whatever data the decisions you use to make a call or not (or generate a certain ast or not) that your call and your called children were based on, and a way to tell if it changed. You then check it before descending the same as you check whether the token is changed above.

j / k navigate · click thread line to collapse