You (or your compiler) write the instructions and data into unified memory (up to 192 GB) and jump to the first instruction (usually of a loop) on each core. GPU and ANE processor cores are not fundamentally different from CPU cores, they just have fewer transistors (gates) and therefore more limitations in what a register can address, what data type or what instruction it can execute. Some cores can only execute the same instruction as there neighbor core in a team, but on different data. Or at a different time, synchronized with neighbors. But they still are Turing complete processors so in essence are the same as their cousins the CPU cores. Sometimes cores input or output addresses are in a pipeline between cores (so it limits its address offset).
MacOS only plays a role in allocating and protecting the instruction or data memory regions for the GPU and ANE processors.