consider this simple assembly-language subroutine i wrote in october (http://canonical.org/~kragen/sw/dev3/tetris.S, screencast at https://asciinema.org/a/622461):
@@ Set sleep for iwait to r0 milliseconds.
@@ (r0 must be under 1000)
.thumb_func
waitis: ldr r2, =wait @ struct timeval
movs r3, #0 @ 0 sec
str r3, [r2] @ .tv_sec = 0
ldr r1, =1000 @ multiplier for ms
mul r0, r1
str r0, [r2, #4] @ set .tv_usec
bx lr
.bss
wait: .fill 8 @ the struct timeval
.text
these are all perfectly normal register-machine instructions; you could translate them one-to-one to almost any register machine. on a few of them you could drop one, writing something like str #0, [wait]. the whole function is straight-line execution, a single basic block, and seven instructions long. this is almost a best case for a stack machine; it looks like this: lit(0) ; load immediate constant
lea(wait) ; load address of wait onto stack
! ; store 0 in wait
lit(1000) ; another constant
* ; multiply argument by constant
lea(wait) ; load address again
lit(4) ; offset of .tv_usec
+ ; calculate address of wait.tv_usec
! ; store product in tv_usec
ret ; return from subroutine
that's 10 instructions, about 50% longer than the 6 or 7 of the two-operand register machine. but the typical case is worse. and basically the reason is that the average number of stack manipulations or constant pushes that you need to do to get the right operands on top of the stack for your memory accesses and computational operations is roughly 1. sometimes you'll have a * or ! (store) or + that's not preceded by a stack manipulation or a load-immediate or a load-address operation, and that time the stack machine wins, but other times it'll be preceded by two or three of them, and that time the stack machine losesso it averages out to about two stack instructions per register-machine instruction. call and return is faster on the stack machine, but passing arguments and return values in registers on the register machine can take away most of that advantage too
the rtx-2000 did some tricks to sometimes do more than a single stack operation per cycle, but it didn't really escape from this
this doesn't necessarily mean that stack machines like the rtx-2000 are a bad design approach! the design rationale is that you get very short path lengths, so you can clock the design higher than you could clock a design that had register-file muxes in the critical path, and you also avoid the branching penalty that pipelines impose on you, and you use less silicon area. mainstream computation took a different path, but plausibly the two-stack designs like the rtx-2000, the novix nc4000, the shboom, the mup21, and the greenarrays f18a could have been competitive with the mainstream risc etc. approach. but you do need a higher clock rate because each instruction does less
i don't remember if dr. ting wrote a book about the rtx-2000, but he did write a very nice book about its predecessor the nc4000, going into some of these tricks: https://www.forth.org/OffeteStore/4001-footstepsFinal.pdf