> pop 1 st and 2nd value out of local stack and swap them
You don't need to pop at all for swap. The stack isn't actually changing size and can be modified in place. You also should only have one pop for 'add', etc. A lot of people don't seem to realise this.
Also, this article may be of interest:
https://blogs.msdn.microsoft.com/ericlippert/2011/11/28/why-...
TL;DR - a stack machine is conceptually simpler than a register machine, and it doesn't matter if it's slower since you are JIT'ing it anyway.
Pike and Winterbottom (http://www.vitanuova.com/inferno/papers/hotchips.html) used register-based VM for Plan9/Inferno because it's much faster and easier to write a high-quality jit from register-based VM to register-based CPU (and all of them are register based) than from byte-code VM to register-based CPU.
This is probably the same reason Android chose register-based VM design for their Java-based platform.
Jitting is not free - an increase in jit complexity is paid by lower performance of the code being jitted. Java's HotSpot does wonders but it takes much latency and high memory use to get the really high-quality results.
It makes sense to do as much work as possible where you have time and CPU cycles to spare (during compilation to VM instruction set) and not on the device when every millisecond and every byte matters.
Lower complexity of stack-based VM is a moot point - when you're jitting, you're doing the hard stuff anyway.
The argument that does make sense to me is, if you want a somewhat faster interpreter at some expense in code size, then a register VM wins. (Perhaps by not as much as commonly thought, because of some optimizations like stack caching and superinstructions: http://savannah.gnu.org/projects/vmgen/ But it does seem to win still.) This has the disadvantage of having to having to implement spilling or some other such complication in the frontend when the number of virtual registers needed exceeds 256 or whatever.
What did you think of the last part of Erik Lipperts post?
The code is much smaller and much easier to understand. A stack machine is a very simple way to describe a complex computation; by being able to write code for such a simple machine, it lowers the cost of making a compiler. And not only is it easier to write compilers and jitters that target simple stack machines, it is easier to write other code analysis tools as well. The IL verifier, for example, can quickly determine when there is a code path through a method that, say, misaligns the stack, or passes the wrong types of arguments to a method.
https://github.com/Conceptual-Inertia/Conceptum/blob/master/...
See https://news.ycombinator.com/item?id=13155259 for an example of what I mean.
It totally matters. See [1] for previous work on this topic. You need to perform expensive data flow optimisations to generate information the stack format doesn't need to encode, but that a register format does need. Better to do this analysis ahead of time in the compiler than at runtime.
[1] https://www.usenix.org/events/vee05/full_papers/p153-yunhe.p...
I am curious as to where you heard this. Modifying the stack in place seems like an implementation detail. I mean even if swap wasn't a primitive (which again seems really far fetched), you are still going to have to modify part of the stack in place eventually. It's just in the naiive way you decrement the stack pointer twice then increment it twice, while in the way I outline you don't bother since the length of the stack is the same after the operation is completed.
Probably easier to write less and pseudo code more - I am assuming a downward growing stack here
# First approach
temp_a = pop(stack)
temp_b = pop(stack)
push(stack, temp_a)
push(stack, temp_b)
# second approach
temp_a = stack[0]
stack[0] = stack[1]
stack[1] = temp_aYour skepticism is justified: https://github.com/frjalex/Conceptum/blob/master/src/main.c#...
(note that this doesn't implement a swap by any means)
Something like "reg1 = reg2 + 125" is a single instruction in almost every register ABI, while at they very, very least, a stack machine would need two instructions: "push 125, add". If reg1 isn't immediately accessible as the stack head, and reg2 isn't consumed as the head, then you also need load/store/swap/roll/etc stuff wrapping the addition.
That's up to 6 (or even more) stack instructions to perform what a register machine can do in 1.
Also, when talking about interpreters, there is a not-insignificant overhead of fetching and dispatching the next instruction. That overhead is continually compounded when you need multiple instructions to perform what you could do in a single operation otherwise. The simpler the instructions are, the higher percentage of time is "wasted" performing instruction dispatch instead of carrying out core instruction processing. This can be parallelized in hardware instruction dispatch, but is pretty firmly stuck being serial in software interpreters.
Stack machines have very concise source code representations, and their individual instructions are simple and fast. But they can be very bloated in terms of the number of native instructions executed to carry out an overall program's work.
Similarly, there's no reason why register-based VMs wouldn't have push/pop instructions.
Is this the graph coloring register allocation optimization problem?
The stack based virtual machine represents instructions as two field structs that are essentially predecoded (and larger in memory) while the register VM uses what boils down to array of words which are then presumably decoded at run time (the paper even contains some kind of argument why the code is represented as array of single-field structs, but completely neglects to mention what would happen if the structs contained some kind of pre-decoded instructions).
The function call mechanism of their stack based VM seems to be essentially modeled after how forth works, which is not how typical stack based VM used in implementation of languages that do not directly expose the VM semantics works. Typical stack based VM (eg. Smalltalk, JVM, CPython...) allocates new argument stack for each function invocation (often using alloca() or such, ie. on native control stack) and does not directly use the VM argument stack to pass parameters and results between functions.
I imagine in the real world, they use something like
union instruction_packet {
int64 value;
int8[8] instructions;
}
Which would be less space efficient (when you hit a push instruction you'd need to move to the next packet, even if it was the first element of the array) but gets rid of the decoding/encoding problem.Instruction decoding for the entire JVM instruction set save tableswitch and lookupswitch amounted to about 50 lines of code, not counting the enum listing every opcode.
Does it the VM stack as well instead of the native stack?
I might be wrong.
In any case I'm very interested to learn about what register-based Forth VMs are out there!
However I'd also point out the register-based VMs was far more difficult to debug, plus the compiler was much more long-winded to account for register allocation. Plus by the time you add on function dispatch overhead and real-world usage patterns, the gap where the register machine is technically better (i.e. besides just doing fibonacci) narrows somewhat.
https://en.wikipedia.org/wiki/Stack_machine#Performance_disa...
Of course the advantages such as easy code generation and high code density make stack machines a good intermediate compilation target, which then gets compiled to a register machine for actual execution.
Are you referring to a specific language here or are you saying all register machine always compile to stack machines first?
Can you elaborate why is that? What would those advantages be in practical terms?
It concluded that register was faster than stack based but stack based had better code density. The reason seemed to be due to the branch prediction cost per instruction. Fewer but more more powerful instructions reduced the number of times the instruction decode operation occurred.
The article has detailed discussions about the topic.
> generally sucks at the hardware level because most CPUs are register-based
Whoa, hold on there folks. This is not a CPU register VM. It's a 'register-based VM' and it isn't even that.
iconst 1 vs set t3, t1
iconst 2 set t4, t2
iadd add t3, t4, t5
t1 through t5 are allocated in the stack frame. They are NOT magically somehow mapped to x86_64 registers R8-12.Actually, calling this a register-based VM is just wrong. It's lazy thinking but really it's just wrong. The value of t1 will be in memory.
Yes, it's standard terminology which easily lets people misunderstand what's happening under the hood, as in the above quoted case. The article doesn't even say virtual registers which would have helped. It could. It should. It doesn't.
BTW, a good interpreter will cache the top of stack in a CPU register.