"99% written by DeepSeek-R1" according to the author.
To be fair torch didn't try very hard optimizing on CPU either.
[0] 200MB is actually a very generous number, i tried to download some AI thing via pip3 the other day and it wanted 600MB or so of CUDA stuff. Meanwhile i do not even have an Nvidia GPU.
It is indeed pretty silly that's not the default and you have to go to https://pytorch.org/get-started/locally/, copy the argument `--index-url https://download.pytorch.org/whl/cpu` to install CPU-only torch. But the alternative would be having the worlds scientists wondering why they can't use their GPUs after `pip install torch` so /shrug
Fwiw, there are always many attempts at optimizing code (assembly etc). This is good! Great to try new techniques. However, you get what you constrain. So I've seen optimized code that drops checks that the compiler authors say are required in the standard. So, if you don't explicitly tell your optimizer "this is a case I care about, this is the desired output" it will ignore that case.
Did we find a faster implementation than the compiler creates? Well, I mean, sure, if you don't know why the compiler is doing what is doing