It's a really deep problem actually! Doing y then x is better but it's far from "optimal".
These parallel programming course notes have an example of using z-order curves which is slightly better yet: http://ppc.cs.aalto.fi/ch2/v7/
And the best solution requires something like Halide https://halide-lang.org/ to find the best traversal order.