For 6502, to get the optimum assembly you'd have to structure your data in structure-of-arrays instead of arrays-of-structures and use indices instead of pointers as much as possible (at least when amount of Ball objects would be < 256).
Instead of passing in a pointer to two separate functions, I'd write a single UpdateBalls procedure that operated on global data. This data is going to be core to my game logic and physics, so I'd put it all on the ZP. As you suggested, "structure-of-arrays". I'd choose a fixed number of balls so I don't need an argument; maybe I'd set my loop to iterate backwards so I get a free zero check with the decrement, maybe I'd unroll the loop ("dead" balls can be placed off-screen with a dx/dy of 0). I'd probably decide that I don't need 16-bit precision for the deltas (how fast could the balls move, really?), and a 16-8 addition is going to be quicker than a 16-16 one.
The compiler isn't going to make these optimizations; that's not a slight against the compiler. In fact, I just checked - the output [0] when I write my C code this way is pretty close to what I'd hand-write. It's roughly a third the number of instructions and - I'm not going to cycle count, so this is a stab in the dark - would take maybe an order of magnitude fewer cycles to run. semu wasn't written with performance on the 6502 in mind, it's not going to have taken considerations like this, so it's going to inevitably be slow when compiled.
Now that this has come up again as the stock reason "you can't do C well on the 6502", replacing the stack, the zero page, and the register set, I'm probably going to reprioritize it and put the register allocator on pause.
On line 14, it uses Y, then decrements it to 0, uses it, increments it, uses decrements, uses it, then increment again.. why not perform the indirect load on lines 18 and 26 without the Y index and eliminate lines 16, 21, and 25?
here's my pseudocode:
rc2 <= base of struct
rc4 = rc2 + 4 // addr of dx
rc5 = rc5 + 0 // addr of x
rc6 = *(&rc2+4)
rc4 = *(&rc4+1) // get low byte
rc5 = rc6 + *(&rc2) // add high byte
rc4 = rc4 + *(&rc2+1) // add low byte
rc2 = rc5 // store high result
*(&rc2+1) = rc4 // store low result
I believe it could have done more to do the work in place, but my batt is about to die :(