undefined | Better HN

0 pointsCamperBob21y ago0 comments

x2 speed increase for ggml by optimizing SIMD: https://github.com/ggml-org/llama.cpp/pull/11453

"99% written by DeepSeek-R1" according to the author.

0 comments

rfoo1y ago

Speaks more about how many low hanging fruits remaining in "NOOOOO I DON'T WANT TO DOWNLOAD 200MiB PYTORCH I'D BETTER REINVENT THE WHEEL"-gang inference stacks.

To be fair torch didn't try very hard optimizing on CPU either.

badsectoracula1y ago

FWIW as someone who "NOOO DOESN'T WANT TO DOWNLOAD 200MB[0] PYTORCH"s i'm glad for those who make alternative minimal/no-dependency stacks that are based on C/C++, like ggml.

[0] 200MB is actually a very generous number, i tried to download some AI thing via pip3 the other day and it wanted 600MB or so of CUDA stuff. Meanwhile i do not even have an Nvidia GPU.

rfoo1y ago

The wheel of CPU-only PyTorch 2.6.0 for Python 3.12 is ~170MiB in size.

It is indeed pretty silly that's not the default and you have to go to https://pytorch.org/get-started/locally/, copy the argument `--index-url https://download.pytorch.org/whl/cpu` to install CPU-only torch. But the alternative would be having the worlds scientists wondering why they can't use their GPUs after `pip install torch` so /shrug

wrsh071y ago

But as a response to the parent saying "LLMs will be great at ts/js slop but not for infra" it's quite reasonable to say: here's an example of someone applying it to backend optimizations today.

Fwiw, there are always many attempts at optimizing code (assembly etc). This is good! Great to try new techniques. However, you get what you constrain. So I've seen optimized code that drops checks that the compiler authors say are required in the standard. So, if you don't explicitly tell your optimizer "this is a case I care about, this is the desired output" it will ignore that case.

Did we find a faster implementation than the compiler creates? Well, I mean, sure, if you don't know why the compiler is doing what is doing

j / k navigate · click thread line to collapse

0 comments

rfoo1y ago

Speaks more about how many low hanging fruits remaining in "NOOOOO I DON'T WANT TO DOWNLOAD 200MiB PYTORCH I'D BETTER REINVENT THE WHEEL"-gang inference stacks.

To be fair torch didn't try very hard optimizing on CPU either.

badsectoracula1y ago

FWIW as someone who "NOOO DOESN'T WANT TO DOWNLOAD 200MB[0] PYTORCH"s i'm glad for those who make alternative minimal/no-dependency stacks that are based on C/C++, like ggml.

[0] 200MB is actually a very generous number, i tried to download some AI thing via pip3 the other day and it wanted 600MB or so of CUDA stuff. Meanwhile i do not even have an Nvidia GPU.

rfoo1y ago

The wheel of CPU-only PyTorch 2.6.0 for Python 3.12 is ~170MiB in size.

wrsh071y ago

But as a response to the parent saying "LLMs will be great at ts/js slop but not for infra" it's quite reasonable to say: here's an example of someone applying it to backend optimizations today.

Did we find a faster implementation than the compiler creates? Well, I mean, sure, if you don't know why the compiler is doing what is doing

j / k navigate · click thread line to collapse