undefined | Better HN

0 pointswk_end2y ago0 comments

Sure. Here's a couple of functions to iterate over an array of "Ball" objects, as you might do in a Breakout-style game that has a multi-ball powerup. I didn't do anything to make it particularly 6502-amenable; it's how I'd write it for a modern machine, probably. Even compiled with -O2: as expected, the code is slow and enormous - a single addition compiles to something like 30 instructions.

[0] https://godbolt.org/z/f6Ysv7nve

0 comments

johnwbyrd2y ago

Seems like your problem is more with the venerable 6502 itself rather than the compiler. Most of that assembly code is spent calculating the offsets inside the Ball struct, which must be done at 16 bits of resolution in every case. The compiler's using the indirect indexed (zero page address with Y offset) 6502 addressing mode to get at all the fields in your struct. It has placed all the variables in zero page, so no instruction is more than two bytes long; additionally, the code in question is entirely linear, with no JSRs or other subroutines. Note in particular how it efficiently uses DEY/INY pairs of one byte instructions to get at low and high bytes of 16-bit memory. Hand-written assembly might be speedier, but not by much and still deal with all the corner cases that your generated code does. "While writing Apple BASIC for a 6502 microprocessor I repeatedly encountered a variant of Murphy's Law. Briefly stated, any routine operating on 16 bit data will require at least twice the code that it should." -Steve Wozniak

dezgeg2y ago

I think that was partially his point - on 6502, typical-looking C code will be horrifically inefficient, at least when compared to other architectures more suitable for C.

For 6502, to get the optimum assembly you'd have to structure your data in structure-of-arrays instead of arrays-of-structures and use indices instead of pointers as much as possible (at least when amount of Ball objects would be < 256).

wk_endOP2y ago

Yes, exactly. Were I hand-writing the assembly for the 6502, I'd make all sorts of decisions that the C code doesn't - and that a compiler can't - to make it more efficient.

Instead of passing in a pointer to two separate functions, I'd write a single UpdateBalls procedure that operated on global data. This data is going to be core to my game logic and physics, so I'd put it all on the ZP. As you suggested, "structure-of-arrays". I'd choose a fixed number of balls so I don't need an argument; maybe I'd set my loop to iterate backwards so I get a free zero check with the decrement, maybe I'd unroll the loop ("dead" balls can be placed off-screen with a dx/dy of 0). I'd probably decide that I don't need 16-bit precision for the deltas (how fast could the balls move, really?), and a 16-8 addition is going to be quicker than a 16-16 one.

The compiler isn't going to make these optimizations; that's not a slight against the compiler. In fact, I just checked - the output [0] when I write my C code this way is pretty close to what I'd hand-write. It's roughly a third the number of instructions and - I'm not going to cycle count, so this is a stab in the dark - would take maybe an order of magnitude fewer cycles to run. semu wasn't written with performance on the 6502 in mind, it's not going to have taken considerations like this, so it's going to inevitably be slow when compiled.

[0] https://godbolt.org/z/WYKKeh9b7

mysterymath2y ago

I'd actually like the llvm-mos to do an automated AoS to SoA analysis and rewrite, but haven't gotten around to it yet. There aren't any intrinsic theoretical obstacles I'm aware of though; it's just difficult code to write.

Now that this has come up again as the stock reason "you can't do C well on the 6502", replacing the stack, the zero page, and the register set, I'm probably going to reprioritize it and put the register allocator on pause.

diydsp2y ago

I found it interesting it didnt utilize the X reg. Maybe instead of updating Y, X could have been used. Then I scrutinized the Y reg use:

On line 14, it uses Y, then decrements it to 0, uses it, increments it, uses decrements, uses it, then increment again.. why not perform the indirect load on lines 18 and 26 without the Y index and eliminate lines 16, 21, and 25?

here's my pseudocode:

rc2 <= base of struct

rc4 = rc2 + 4 // addr of dx

rc5 = rc5 + 0 // addr of x

rc6 = *(&rc2+4)

rc4 = *(&rc4+1) // get low byte

rc5 = rc6 + *(&rc2) // add high byte

rc4 = rc4 + *(&rc2+1) // add low byte

rc2 = rc5 // store high result

*(&rc2+1) = rc4 // store low result

I believe it could have done more to do the work in place, but my batt is about to die :(

j / k navigate · click thread line to collapse