Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].
Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.
if (theradIdx.x < 4) {
A;
B;
} else {
X;
Y;
}
Z;
The diagram shows how this executes in the following order:Volta:
->| ->X ->Y ->Z|->
->|->A ->B ->Z |->
pre Volta: ->| ->X->Y|->Z
->|->A->B |->Z
The SIMD equivilant of pre Volta is: vslt mask, vid, 4
vopA ..., mask
vopB ..., mask
vopX ..., ~mask
vopY ..., ~mask
vopZ ...
The Volta model is: vslt mask, vid, 4
vopA ..., mask
vopX ..., ~mask
vopB ..., mask
vopY ..., ~mask
vopZ ...
[1] https://chipsandcheese.com/i/138977322/shader-execution-reor...[2] https://stackoverflow.com/questions/70987051/independent-thr...