Why wouldn’t you be able to run them in parallel using CUDA? You shouldn’t be memory-transfer speed limited when group convolution layers are a part of a bigger net.
Note that pointwise 1x1 convolutions are a special case of group convolutions and actually I think they might be specially optimized in PyTorch (I’d have to run some benchmarks to test it though).
pointwise isn’t a case of grouped conv, they’re orthogonal ideas.
You can fuse grouped convs (depthwise is a special case of grouped convs) into preceding or following layers. Maybe JAX can do this already? No clue if any library offers such an optimization out of the box
Sorry, yes, I was replying to the post about depthwise convolution and that’s what I meant (though the naming of it is poor) - i.e. the special case of group convolutions where the number of groups is equal to the number of channels.