You said you didn't explicitly use simd, but did you do anything to help the optimizer autovectorize like float chunking
I only used traits to more easily implement the scenes; a Scene needs to implement a new(), a start() and an update(), so that I can put them in an array and call them like scenes[current_scene_idx].update() from the main loop.
Also, I used some short and simple closures to avoid repeating the same code in many places (like a scope-local write() closure for the menus that wraps drawtext() with some default parameters).
The vast majority of the time is spent in the triangle filling code, where probably some autovectorization is going on when mixing colors. I tried some SIMD there on x86 and didn't see visible improvements.
Apart from obvious and low-hanging fruit (keeping structs simple, keeping the cache happy, don't pass data around needlessly) I didn't do anything interesting. And TBH profiling it shows a lot of cache misses, but I didn't bother further.
drawmeshindexed(m: &Mesh, mat: &Mat4x4, tex: &Image, uv_off: &TexCoord, li: &LightingInfo, cam: &Camera, buf: &mut Framebuffer)
so there is also no global state/objects. All state is passed down into the functions.
There were some cases that RefCells came in handy (like having an array of references of all models in the scene) and lifetimes were suggested by the compiler at some other similar functions, by I ended up not using that specific code. To be clear, I have nothing against those (on the contrary), it just happened that I didn't need them.
One small exception: I have a Vec of Boxes for the scenes, as SceneCommon is an interface and you can't just have an array of it, obviously.
Here's a counterpoint: every time you write a for loop in Rust, you are using iterators.