That's pretty much how that works, and they sync threads inside warp, I have been groping in the dark for a while, and always hoped someone can write up some details to help me to get the idea straight.
Intel, AMD, Google(TPU) all have their own way to schedule 'kernel's which are very different from CUDA, there are no details about them, I was just curious like 'how do they work across CPU|GPU'?
Thanks for the reply.