undefined | Better HN

0 pointsender79y ago0 comments

The first step in compilation is lexing -- converting a character stream to a stream of semantic "tokens", where a token might be "a number literal" or "the 'while' keyword" or a single character token like "{". This process is usually done via regexs.

0 comments

vidarh9y ago

It is rare for lexers to use actual regex engines, in my experience, as it's impractical if you want to do proper error reporting unless you write your own regex engine in which case people tend to opt for hand-writing lexer rules.

It's not unheard of, and I have the impression it's getting more common, but still only a small proportion of the compilers I've seen over the years explicitly use regexps.

rocket699y ago

So, what parses the regex in the lex, flex, Jison, etc...

vidarh9y ago

Most tools like that reduces any regexes to dfa's or similar, but most production compilers I've seen don't use tools like that because they're a pain to do proper error reporting for.

1 more reply

chrisseaton9y ago

I didn't think of lexer rules as regular expressions but I suppose they are of course.

But I can't imagine a lexer would ever be the performance bottleneck in a compiler.

adrianratnapala9y ago

According to textbooks, lexers are the bottleneck because they are the only part that has to literally look at every byte of the source (I am including hashing as part of the lexing process).

I am not sure if this is true any more. It probably depends on the language.

Even as CPUs grow faster, code bases grow bigger, so the number-of-bytes argument is still important. On the other hand, heavy optimisations and difficult languages like C++ will shift the bottlenecks to later stages.

readittwice9y ago

I think I read that too, but I am not sure that this is actually true. Please correct me if I am wrong, according to the Rust developers code generation seems to be the bottleneck. That's why they are working on incremental compilation to improve compilation times. They detect changes in the input files by comparing the AST to the old version. So parsing is always necessary but later stages can be cached. That seems to confirm that parsing is not the bottleneck, at least not for Rust.

OTOH that probably also depends on your use case, JS for example needs to get parsed at every page load. Some JS-VMs only parse the source code lazily, at first the parser only detects function boundaries. Only if this function is really executed, the function gets parsed completely.

4 more replies

titzer9y ago

Depending on how source programs are split up into files, parsing can be easily parallelized (think one thread per file), while other compilation tasks are harder due to interdependencies. E.g. semantic analysis requires building type representations, global namespaces, etc. Code generation is (usually) parallelizable as well, but there are a couple very serial steps in the middle, too.

My experience is that parsers for source languages can reach into the 1-10mb/s range, and depending on how complex the IRs and transformations are after that, code generation is usually around 0.5mb-5mb/s. The stuff in the middle (dealing with IRs) is harder to measure in terms of bytes.

ab129c871xy9y ago

Perhaps in an ancient one-pass C compiler. Modern compilers build many trees, usually one for each pass. Later stages do expensive operations on sets (even without optimization one has to do basic register allocation), so I'd say the lexing stage is entirely negligible.

jbangert9y ago

C++ compilers (at least on most current code bases that don't need modules) still need to parse and lex headers though -- so actual executable code is pretty rare. Most compilations are also debug builds which don't optimize(but creating debug information can also be slow; the really slow step is usually linking)

signa119y ago

> But I can't imagine a lexer would ever be the performance bottleneck in a compiler.

one rather trivial way to observe the effects of avoiding building a source file is to use ccache. ccache avoids the recompilation (even if you do a make clean, or some such), and it is not uncommon to observe speed ups of factor of 5 or so.

however, once you have crossed that barrier, you hit the linking wall. which is where you would end up spending a large portion of time. gold (https://en.wikipedia.org/wiki/Gold_(linker)) optimizes that i.e. supports incremental linking, but unfortunately, i haven't had any experience in large code-bases where it is being used.

pertymcpert9y ago

No chance, modern compilers spend most of their time in the optimization stages. Lexing is peanuts.

1 more reply

kjax9y ago

Lexer generators typically allow describing rules with regular expressions, but through Thompson's construction, subset construction, and DFA minimization, you can end up with a blazing fast/optimized DFA that processes your input at maximum speed.

paulddraper9y ago

I think they're usually done with a specific FSM, rather than general regular expressions.

titzer9y ago

Not all parsers do a lexing step. Such ones are called "scannerless parsers" and can be much, much faster than parsers with a separate lexing step.

duaneb9y ago

A lot of optimizations on arbitrary byte code often look for patterns in byte streams (or in reified assembly) in similar ways to regexes.

teraflop9y ago

For many years, an important stage of the GHC Haskell compiler consisted of a giant Perl script full of regexes.

http://code.haskell.org/ghc-scp/ghc/docs/comm/the-beast/mang...

j / k navigate · click thread line to collapse

0 comments

vidarh9y ago

It's not unheard of, and I have the impression it's getting more common, but still only a small proportion of the compilers I've seen over the years explicitly use regexps.

rocket699y ago

So, what parses the regex in the lex, flex, Jison, etc...

vidarh9y ago

Most tools like that reduces any regexes to dfa's or similar, but most production compilers I've seen don't use tools like that because they're a pain to do proper error reporting for.

1 more reply

chrisseaton9y ago

I didn't think of lexer rules as regular expressions but I suppose they are of course.

But I can't imagine a lexer would ever be the performance bottleneck in a compiler.

adrianratnapala9y ago

According to textbooks, lexers are the bottleneck because they are the only part that has to literally look at every byte of the source (I am including hashing as part of the lexing process).

I am not sure if this is true any more. It probably depends on the language.

readittwice9y ago

4 more replies

titzer9y ago

ab129c871xy9y ago

jbangert9y ago

signa119y ago

> But I can't imagine a lexer would ever be the performance bottleneck in a compiler.

pertymcpert9y ago

No chance, modern compilers spend most of their time in the optimization stages. Lexing is peanuts.

1 more reply

kjax9y ago

paulddraper9y ago

I think they're usually done with a specific FSM, rather than general regular expressions.

titzer9y ago

Not all parsers do a lexing step. Such ones are called "scannerless parsers" and can be much, much faster than parsers with a separate lexing step.

duaneb9y ago

A lot of optimizations on arbitrary byte code often look for patterns in byte streams (or in reified assembly) in similar ways to regexes.

teraflop9y ago

For many years, an important stage of the GHC Haskell compiler consisted of a giant Perl script full of regexes.

http://code.haskell.org/ghc-scp/ghc/docs/comm/the-beast/mang...

j / k navigate · click thread line to collapse