When you think more about regexes this way, you realize that a regex is just a tiny description of a virtual machine (or emulator) that can process the simplest of instructions (check for 'a', accept '0-9', etc.). Each step in the regex is just a piece of bytecode that can execute, and if you turn a regex on its side you can visualize it as just a simple assembly program.
Interestingly, regexes/FSMs are (IIRC) the most powerful class of machines for which equivalence is decidable. So if you give me any two regexes, I can tell you if they match on all the same strings, but this is not true for any more powerful grammar.
Wikipedia [1] shows there's a paper called "The equivalence problem for deterministic pushdown automata is decidable" that won the Gödel Prize is 2002. I haven't read the paper nor do I currently have access to it though.
[1] https://en.wikipedia.org/wiki/Deterministic_pushdown_automat...
Any particular pair grammars from that class. I mean if you showed me a yacc grammar for Pascal and another one for Java, I am sure I could prove they were ineqivelent.
More tricky, if you gave someone clever a hand written Pascal parser and that same yacc grammar she could probably prove equivilency with some effort. But there is no guarantee that this is always possible.
which is _exactly_ what bpf does :)
You can get something similar to a virtual equivalent of a turing complete language, due to the fact that the tape of the turing machine is, contrary to the hypothetical theory, practically limit by recourses. CPUs are finite state automatons, yet we use them to solve problems of sufficiently low complexity in NP space.
I am not an expert, please correct me if I am wrong.
Most people refuse to write one because it's so easy not to. Why bother?
It will make you a better coder for the rest of your life.
Let's make a list of "power projects" like this. A bytecode interpreter, a software rasterizer... What else?
* Compression (lossless, lossy, image, audio, texture, video)
* Languages (bytecode interpreter, AST interpreter, parser/lexer for a simple language, simple JIT, understanding instruction scheduling)
* DSP programming (writing programs for fast, branchless math)
* Comfort with binary and binary formats (start with packfiles .zip/.tar, move onto reverse engineering simple formats for e.g. games)
* Understanding the difference between RAM and address spaces (e.g. understanding virtual memory, mmap, memory-mapped IO, dynamic linking, the VDSO, page faulting, shared memory)
* Device drivers (easier on Linux, understanding userspace/kernel interaction, ioctls, how hardware and registers work, how to read spec sheets and hardware manuals)
* Graphics (modern software rasterizer that's not scanline-based, understanding 3D and projective transforms, GPU programming and shaders, basic lighting (reflection and illumination) models, what "GPU memory" is, what scanout is, how full scenes are accumulated all along the stack)
I could go heavily into depth for any one of these. Ask me questions if you're interested! They're all fun and I always have more to learn.
Also, the longer you get into any given track, the more you realize they all connect in the end.
If anyone is interested in creating an indie/hobby game engine, I'd recommend the following:
- game loop with independent render FPS and a constant simulation tick delta (otherwise it runs non-deterministic due to floating point) - "entities, components, systems" architecture (ECS) - data-driven content (load level and game object data from JSON or similar, and programmatically build your scene) - basic event or messaging system
Honestly, I can't think of a field that covers as wide a range of CS topics at the depth I've seen in lower level game development.
If you focus on understanding and implementing the popular patterns and algorithms recommended by the indie community, its not the most daunting of projects. There's so much great information, and a few good books that break down and walk through a working implementation you can build on once you understand it from the ground up.
Do you think there's any value in going back to say, DOS based VGA programming? People in #osdev thought I was a bit strange for wanting to write a bootable kernel that only put pixels on the screen, but I really enjoy the idea of starting with plotting pixels, then moving on to more complex effects.
now that it is kind of done, i want to make it faster :) for example, having a home-grown vector_3d class kind of sucks (performance wise). it might be better to have vector_3d be actually based on, say, numpy.array ? once that is done, and i start seeing some real improvement, it might be possible to go the other route as well i.e. write the hot-spots in c++/c, and interface that with python.
or go with lua all the way ? or maybe try a hybrid approach (which would allow you to see how embeddable the language really is)
possibilities are endless, and as you so rightly said, gratification is instantaneous :)
* An MVC web framework.
* A GUI validation library.
* A JavaScript UI library.
* An iteratee implementation.
* An Erlang-style actors library in another language.
* Implementing for-yield, async-await, etc on top of delimited continuations.
* Interpreters, typecheckers, code generators.
* Some of an ECMAScript implementation.
* Concurrent data structures: futures/promises, queues, etc.
* A simple roguelike game.
Same category of probably not a great idea to use and claim is better than what's out there ( it won't be ) but boy howdy will it teach you about distributed systems and pitfalls that will absolutely help you when you are working on one at your day job.
Having just a rasterizer is like naked VM without bytecode compiler.
So I decided to add HTML/CSS engines to it.
But if you have HTML/CSS then why not to add scripting to it? So added VM executing bytecodes, GC and compiler producing those bytecodes. Having VM I thought that it would be cool to have built-in persistence in it - you need to persist UI state somehow, right? So it got integrated NoSQL database on board.
That's pretty much how http://sciter.com was born.
Some stats:
$ time vim -c 'edit kernel-total' -c q
real 0m8.162s
user 0m7.687s
sys 0m0.398s
$ vim kernel-total ^Z
$ ps aux | awk "/kernel-total$/ || NR == 1"
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
hawski 10467 2.7 17.1 805120 670988 pts/2 T 17:41 0:07 vim kernel-total
$ time emacs kernel-total -f kill-emacs
real 0m7.155s
user 0m6.869s
sys 0m0.237s
$ emacs kernel-total ^Z
$ ps aux | awk "/kernel-total$/ || NR == 1"
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
hawski 10825 43.8 14.8 857484 581152 pts/2 T 17:47 0:07 emacs kernel-total
With vim editing is quite smooth and with emacs it feels laggy. I only edited a bit, like entering a new line here and there, moving few pages down, going to the end of the file and to the top again. Saving is sluggish.* Floating point manipulation and numerical computation
* Rolling some own crypto (strictly for fun use)
With proper test vectors and some code review, crypto is in fact quite easy. More dangers lie un the use of crypto, especially when the API is flexible or complex. And of course good old exploitable bugs.
[1] https://en.wikipedia.org/wiki/IBM_Informix-4GL -- I am flabbergasted to see 4GL is still in use 30 years on.
Most people don't really realise that the more complex an inliner gets, the more similarities it gets to a rabid animal you can just barely restrain on a leash.
My prof said he had laughed at my "how hard can it be?"-attitude . I worked my ass off and ended up being asked whether he could use the inliner in a language often mentioned on HN today :)
The inliner was superseded by a much better one in 2003 written by a much smarter person than I, though.
The advantage of emulating a real system/chip is that if you can find old software for it, you avoid the step of having to invent your own programs: real, running programs already exist.
A great example of this mindset is the guy who bought a mainframe. [1]
Refuse to be placed in a silo. Work your way up and down the stack and you'll be much better placed to solve problems and learn from the patterns that repeat at all levels.
Sometimes, you can use end-errors to tell which component has the issue. For instance, if a web site gives a 502 error, the problem is likely with the load balancer or lower network stack on the web server. 404 would often be a file system level issue on the web server. 500 is frequently a network issue between web server and database server. 400 is a problem with the site presentation code, or maybe database malforming addresses.
This. No matter how specialised you are (or want to be) always strive to have at least a basic understanding of the full stack and everything else that your work touches through a couple of levels of indirection (including the wetware such as, in commercial contexts, having a good understanding of your client's business even if you aren't even close to being client-facing) because it will help you produce much more useful/optimal output and can be a lot more helpful when your colleagues/compatriots/partners/what-ever his a technical problem. Heck, at the logical extreme a little cross discipline understanding could even lead you to discovering a better method of doing X that strips out the need for Y altogether, revolutionising how we do Z.
Of course don't go overboard unless you are truly a genius... Trying to keep up with everything in detail is a sure-fire route to mental burn-out!
1) Domain-specific languages (DSLs)
2) Virtual machines (or just explicit state machines more generally)
What I mean is, alot of problems could be solved cleanly, elegantly, more safely, and more powerfully by using one (or both) of the above. The problem is that when people think DSL or VM, they think big (Scheme or JVM) instead of thinking small (printf). A DSL or VM doesn't need to be complex; it could be incredibly simple but still be immensely more powerful than coding a solution directly in an existing language using its constructs and APIs.
Case in point: the BSD hexdump(1) utility. POSIX defines the od(1) utility for formatting binary data as text, and it takes a long list of complex command-line arguments. The hexdump utility, by contrast, uses a simple DSL to specify how to format output. hexdump can implement almost every conceivable output format of od and then some using its DSL. The DSL is basically printf format specifiers combined with looping declarations.
I got bored one day and decided to implement hexdump as a library (i.e. "one hexdump to rule them all"), with a thin command-line wrapper that emulates the BSD utility version. Unlike BSD hexdump(1) or POSIX od(1), which implement everything in C in the typical manner, I decided to translate the hexdump DSL into bytecode for a simple virtual machine.
http://25thandclement.com/~william/projects/hexdump.c.html
The end result was that my implementation was about the same size as either of those, but 1) could built as a shared library, command-line utility, or Lua module, 2) is more performant (formats almost 30% faster for the common outputs, thanks to a couple of obvious, easy, single-line optimizations the approach opened up) than either of the others, and 3) is arguably easier to read and hack on.Granted, my little hexdump utility doesn't have much value. I still tend to rewrite a simple dumper in a couple dozen lines of code for different projects (I'm big on avoiding dependencies), and not many other people use it. But I really liked the experience and the end result. I've used simple DSLs, VMs, and especially explicit state machines many times before and after, but this one was one of the largest and most satisfying.
The only more complex VM I've written was for an asynchronous I/O SPF C library, but that one is more difficult to explain and justify, though I will if pressed.
1. Every function is a tiny VM already. Every macro is a layer on top of a "compiler" to let you redesign a language. LISP gives much more power, precisely because every program is its own DSL, and all power of the language within that DSL is available. http://www.paulgraham.com/avg.html
2. In SICP they show how to build anything from LISP constructs. The elegant thing is that LISP actually only needs very few machine primitives. https://mitpress.mit.edu/sicp/full-text/book/book-Z-H-30.htm...
The most powerful constructs I've seen is a combination of these 2: Clojure/async is a macro which transforms any program to state machines. I think that kind of power gives you 20-30x advantage in business type applications. In fact I've seen C++ programmers spending a decade on discovering similar things by themselves. I strongly believe everyone should know at least a little about LISP.
When I think DSL I think regular expressions (especially PCREs) or SQL, where the syntax is tailored-designed for the specific task at hand.
The problem with in-program transformations is that you're still largely bound to the syntax of the host language. That's particularly the case with s-expression. That binding has amazing benefits (e.g. when auto-transforming programs into resumable state machines) but also puts constraints on the language syntax you can employ. That's not a problem when you're dealing with experienced engineers and devising technical solutions, but people tend to stay away from DSLs generally because they fear the burden imposed on others (including their future selves) having to learn a new language, how to use it, and how it integrates within a larger framework. You minimize that burden and make DSLs more cost-effective by maximizing the expressiveness of the DSL in the context of its purpose, and minimize switching costs by making the delineation between the host language and the DSL clear and obvious.
So, for example, in concrete terms you'd generally implement arithmetic expressions using infix notation. You can sorta implement infix notation in LISP, but you still have the s-expression baggage (such as it is; it's obviously not baggage from the expert's standpoint), which among other things makes it difficult to know where the host language ends and the DSL begins.
Lua had a head start on most popular languages with its LPeg PEG parser, which makes it trivial to parse custom DSLs into ASTs. For all the limitations of PEGs[1], they're just amazingly powerful. But to drive home my previous point, while expert LPeg users use the Snowball-inspired pattern where you instantiate PEG terms as Lua objects and build grammars using Lua's arithmetic operators (with operator overloading "a * b" means concatenation instead of multiplication, and "a^1" means repeat 1 or more times), newbies tends to prefer the specialized syntax which uses a notation similar to the original PEG paper and to typical regular expression syntax. (LPeg includes a small module to parse and compile that notation.)
Converting the AST into useable code is another matter, and that's an area where LISP shines for sure. And now that I think about it, the best of both worlds might be a PEG parser in LISP. But I'm completely ashamed to say I haven't use LISP or LISP derivatives.
[1] Yes, implementing left-recursion can be taxing using a PEG. But the syntax is so intuitive and powerful that the alternatives just can't replace it. Regular expressions are limited, too, but they'll never replace PEGs (or anything more sophisticated) because their expressiveness is just too powerful and cost-effective within certain domains. Instead we supplement regular expressions rather than replace them; and if PEGs continue catching on I expect PEGs to be supplemented rather than replaced.
https://github.com/wahern/dns/blob/master/src/spf.rl
(The Ragel stuff is for parsing the SPF records and is a rather small detail. I always intended to remove it--it's a dependency that turns alot of people off--but never got around to it. But, FWIW, Ragel is awesome when you don't mind the build-time dependency.)Basically what happened was that when I set out to write an async SPF library I found myself constantly refactoring the internal API and devising ever more clever hacks for implementing continuations in C. The biggest problems were
1) It had to support asynchronous I/O, and specifically non-blocking I/O.
2) I explicitly DID NOT want to use callbacks because one design goal of the library was that it be easy to integrate into other code and frameworks, regardless of event loop and even without an event loop. I had recently implemented a non-blocking DNS library after years of using and hacking on ADNS and c-ares taught me to hate callbacks in C.
Instead, you repeatedly call spf_check() until you get something other than EAGAIN. When you get EAGAIN, you use spf_pollfd(), spf_events(), and spf_timeout() for the polling parameters. (The file descriptor can change dynamically, e.g. if the DNS query had to switch to TCP from UDP. The events can be POLLIN or POLLOUT, though usually POLLIN, and the timeout is necessary because DNS send/recv timeouts can be much smaller than the timeout for the whole SPF check operation.)
3) SPF is recursive--policies can include other policies, which can include other policies. So you're already dealing with a logical stack situation, but you can't rely on the language's runtime stack. Also, for each rule there can be an iteration over DNS records with intermediate lookups, and it's especially ugly in C to pause and resume loops with inner function calls.
4) I had never implemented SPF before. I kept running into areas where my approach turned out to be deficient as I became more familiar with the spec. And because of constraints 1-3, this was very costly in time and effort. Were I to implement an SPF library again I probably wouldn't use a VM; instead I'd go back to using a basic state machine, function jump table, and a simpler state stack. But at the time it made sense because it was a more flexible and conceptually simpler approach given the unknowns. Plus, it was an excuse to try out a concept I had long had, which was implement a VM for non-blocking I/O operations.
So basically what it does is decompose the terms in an SPF policy to logical opcodes. So the "ip4:1.2.3.4" term composes into an opcode which pushes a typed address (1.2.3.4) onto the stack and the opcode (OP_IP4) for matching the address at the top of the stack to the address of the connecting host (which is part of the context of an instantiated SPF processor object).
The term "mx:foo.bar", for example, requires querying MX records for foo.bar, iterating over each host and recursively querying the A or AAAA records, and then matching the IP address like above. The "mx:" term is first emitted like the "ip:" term--push domain string following my OP_MX. But when the OP_MX opcode is executed it dynamically generates more bytecode to do the query. I don't remember why I did it this way--probably because it seemed simpler/easier at the time.
It works well, but objectively is over-engineered. At the very least the VM is _too_ general. Nonetheless, AFAIK it's the most comprehensive async I/O SPF library in C that doesn't use callbacks. It only passes ~80% of the SPF test suite, but that's mostly because of the policy parser--IIRC the ABNF grammar for SPF is broken because the proof-of-concept was implemented using a naive regular expression to chop of the string (the ABNF translation of the pattern is ambiguous); and I never bothered or needed to handle some of the corner cases necessary to match the behavior of, e.g., libspf2. But my implementation handles many millions (if not billions) of queries every day as it's used in several commercial and cloud products.
I've had use for that knowledge, nearly fifteen years later - most of the interesting learnings about building one has been about the inner loop.
The way you build a good interpreter is upside-down in tech - the system which is simpler often works faster than anything more complicated.
Because of working on that, then writing my final paper about the JVM, contributing to Perl6/Parrot and then moving onto working on the PHP bytecode with APC, my career went down a particular funnel (still with the JVM now, but a logical level above it).
Building interpreters makes you an under-techtitect, if that's a word. It creates systems from the inner loop outwards rather than leaving the innards of the system for someone else to build - it produces a sort of double-vision between the details and the actual goals of the user.
[1] - "Design of the Portable.net interpreter"
I haven't implemented those in my system yet, and also have no idea how Python or PHP does it.
Is PHP's VM a stack based one? I do read the Zend/ directory of PHP's source, but it is really hard to follow and there is virtually no documentation on the VM
That's relatively easy when implementing an assembler for your opcodes. Just keep track of symbolic labels and their associated jump points (as a simple array or linked list) and process (i.e. finalize or "link") the jump points when the label address becomes known. My "assemblers" often have constructs like:
L0
...
J1
...
J0
...
L1
where L? registers a symbolic jump destination (i.e. x.label[0].offset = label_offset) and J? emits an unconditional jump and registers a link request (i.e. push(x.label[1].from, jump_opcode_offset)). When a block is finished all the offsets are known; you just process things like for (i = 0; i < x.nlabel; i++) {
for (j = 0; j < x.label[i].nfrom; j++) {
patch_in_offset(x.label[i].from[j], x.label[i].offset)
}
}
Knowing when to emit symbolic label and jump instructions from the AST is a little more involved, but no more than analyzing the AST for anything else.Supporting computed gotos would require much more bookkeeping, I'd imagine, and I'm not surprised few languages support that construct. Or maybe not... I haven't really thought it through.
One cool thing about this whole exercise is that it helps to demonstrate why generating some intermediate representation can be easier (conceptually and mechanically) than directly generating runnable code in a single pass. It seems more complex but it really makes things easier.
It's interesting that he implemented function calls via a jump. In my VM a function is just mapped to a name (variable), so functions are first class. When the VM gets to a CALL instruction, it loads the bytecode from the hash table (via a lookup of the name).
Since this is a procedural language where statements can be executed outside of a function, implementing the functions as a jump would be difficult because there would need to be multiple jumps between the function definition and statements that aren't in a function.
I really wish my CS program had a compilers class, but unfortunately they don't, so I had to learn everything on my own.
I recently threw something together sort of like this, just for fun (I like your interpreter's name better though): https://github.com/briansteffens/bemu
It's crazy how much these little projects can clarify your understanding of concepts that seem more complicated or magical than they really are.
Given it has some parts already written in the interest of time...
(You basically implement every layer starting with the CPU and finishing with a working Tetris game)
[1] this one, IIRC https://sourceforge.net/projects/circuit/
It is very close to a bytecode interpreter, only that it comes with a specification that is actually the opcode list for the MOS 6502 (and few details you need to take into account when implementing that CPU).
Besides there are cross-compilers that allows you to generate 6502 code from C for your specific VM (see cc65).
ORG START
START: ST_B 10
LOOP: ST_B 10
ADD_B ;;value will go back on stack
LD_B V1
SM_B V1 ;;value we use next loop
SM_B V1 ;;value we compare
SM_B V1 ;;value we echo to console
TRP 21 ;;writes to the console
ST_S '',13,10,$
TRP 21 ;;writes to the console
CMP_B 50 ;;compares stack to the constant
JNE LOOP
ST_S 'The End',13,10,$
TRP 21 ;;writes to the console
END
Similarly, annotating, transforming, folding, pattern matching on, CPS transforming etc. the produced AST is pretty trivial in a language that supports these constructs. And again, a nightmare in C.
That leaves codegen, but using the right abstractions it turns into a very manageable task as well.
Here's a compiler written in Haskell for LLVM [0].
I've written several lexers in C-like languages, it's not that painful. I wouldn't dare write a parser though.
http://www.lukas-renggli.ch/blog/petitparser-1
http://www.themoosebook.org/book/internals/petit-parser
This include the dynamic generation of blocks and arrows style things...
http://www.multigesture.net/articles/how-to-write-an-emulato...
Okay I guess technically I used a parser generator that I then modified to build an AST and convert it into assembly-like code that fed the interpreter.
I'd add reading and implementing a protocol from RFC - it's a great way to start thinking about design, especially if you read the original RFCs and work forward through the revisions and see what was kept vs deprecated.