undefined | Better HN

0 pointsbuescher1y ago0 comments

I remember the bragging point on the RTX2000 in the very late eighties was "a MIP per megahertz".

0 comments

my limited experience with stack machines is that a stack mip is about half a register-machine mip :-(

consider this simple assembly-language subroutine i wrote in october (http://canonical.org/~kragen/sw/dev3/tetris.S, screencast at https://asciinema.org/a/622461):

            @@ Set sleep for iwait to r0 milliseconds.
            @@ (r0 must be under 1000)
            .thumb_func
    waitis: ldr r2, =wait           @ struct timeval
            movs r3, #0             @ 0 sec
            str r3, [r2]            @ .tv_sec = 0
            ldr r1, =1000           @ multiplier for ms
            mul r0, r1
            str r0, [r2, #4]        @ set .tv_usec
            bx lr

            .bss
    wait:   .fill 8                 @ the struct timeval
            .text

these are all perfectly normal register-machine instructions; you could translate them one-to-one to almost any register machine. on a few of them you could drop one, writing something like str #0, [wait]. the whole function is straight-line execution, a single basic block, and seven instructions long. this is almost a best case for a stack machine; it looks like this:

    lit(0)      ; load immediate constant
    lea(wait)   ; load address of wait onto stack
    !           ; store 0 in wait
    lit(1000)   ; another constant
    *           ; multiply argument by constant
    lea(wait)   ; load address again
    lit(4)      ; offset of .tv_usec
    +           ; calculate address of wait.tv_usec
    !           ; store product in tv_usec
    ret         ; return from subroutine

that's 10 instructions, about 50% longer than the 6 or 7 of the two-operand register machine. but the typical case is worse. and basically the reason is that the average number of stack manipulations or constant pushes that you need to do to get the right operands on top of the stack for your memory accesses and computational operations is roughly 1. sometimes you'll have a * or ! (store) or + that's not preceded by a stack manipulation or a load-immediate or a load-address operation, and that time the stack machine wins, but other times it'll be preceded by two or three of them, and that time the stack machine loses

so it averages out to about two stack instructions per register-machine instruction. call and return is faster on the stack machine, but passing arguments and return values in registers on the register machine can take away most of that advantage too

the rtx-2000 did some tricks to sometimes do more than a single stack operation per cycle, but it didn't really escape from this

this doesn't necessarily mean that stack machines like the rtx-2000 are a bad design approach! the design rationale is that you get very short path lengths, so you can clock the design higher than you could clock a design that had register-file muxes in the critical path, and you also avoid the branching penalty that pipelines impose on you, and you use less silicon area. mainstream computation took a different path, but plausibly the two-stack designs like the rtx-2000, the novix nc4000, the shboom, the mup21, and the greenarrays f18a could have been competitive with the mainstream risc etc. approach. but you do need a higher clock rate because each instruction does less

i don't remember if dr. ting wrote a book about the rtx-2000, but he did write a very nice book about its predecessor the nc4000, going into some of these tricks: https://www.forth.org/OffeteStore/4001-footstepsFinal.pdf

j / k navigate · click thread line to collapse

0 comments

kragen1y ago

my limited experience with stack machines is that a stack mip is about half a register-machine mip :-(

consider this simple assembly-language subroutine i wrote in october (http://canonical.org/~kragen/sw/dev3/tetris.S, screencast at https://asciinema.org/a/622461):

            @@ Set sleep for iwait to r0 milliseconds.
            @@ (r0 must be under 1000)
            .thumb_func
    waitis: ldr r2, =wait           @ struct timeval
            movs r3, #0             @ 0 sec
            str r3, [r2]            @ .tv_sec = 0
            ldr r1, =1000           @ multiplier for ms
            mul r0, r1
            str r0, [r2, #4]        @ set .tv_usec
            bx lr

            .bss
    wait:   .fill 8                 @ the struct timeval
            .text

    lit(0)      ; load immediate constant
    lea(wait)   ; load address of wait onto stack
    !           ; store 0 in wait
    lit(1000)   ; another constant
    *           ; multiply argument by constant
    lea(wait)   ; load address again
    lit(4)      ; offset of .tv_usec
    +           ; calculate address of wait.tv_usec
    !           ; store product in tv_usec
    ret         ; return from subroutine

the rtx-2000 did some tricks to sometimes do more than a single stack operation per cycle, but it didn't really escape from this

j / k navigate · click thread line to collapse