Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...
Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...
To try the speed on the playground: http://playground.kog.ai