Karpathy's llama2.c ported to pure Python (opens in new tab)

(github.com)

6 pointsatairov2y ago10 comments

10 comments

I made a jupyter notebook "llama2.ipynb" from the Karpathy project: https://github.com/rbitr/llama2.ipynb

I didn't do a pure python, mine uses numpy, and although I haven't benchmarked, it runs the stories15M model much faster than 1.3 tok/sec on my 2018 macbook. You should try swapping in numpy matrix multiplication, or @ (I actually don't know if that's native or part of another package) for matmul and see what changes.

atairovOP2y ago

1.3 tok / sec is something similar to my Python version port performance, but I tried on M1 Max

Bostonian2y ago

The llama2.py code defines its own accum, rmsnorm and matmul. Why not use NumPy? A "pure Python" code that is much slower than one using NumPy is less interesting to me.

atairovOP2y ago

If your goal is to make it as fast as possible, then for sure Python implementation is not a solution here. I think for this exactly reason llama.cpp got high attention

behnamoh2y ago

I find these efforts impressive, but what is the value proposition here? (I'm not just talking about this fork, but also Karapathy's llama2.c as well).

atairovOP2y ago

Personally for me the value was to implement a complex logic from a scientific paper in a pure Python. It helps to understand the essence of a cutting edge AI technology. And it's quite fascinating that it would take about 500 lines of core part code to implement inference for such a complex solution.

atairovOP2y ago

Regarding the original llama2.c as I believe the value proposition is to have simple implementation that can execute the inference locally on wide variety of platforms. What if we can execute fine-tuned Llama7B on our phones?

brucethemoose22y ago

> What if we can execute fine-tuned Llama7B on our phones?

7B and 13B are already quite performant with mlc-llm (which uses an Apache TVM Vulkan/Metal backend). Llama.cpp has the potential to perform well too.

These "single file" implementations are not meant to be optimized or feature rich, I dont think.

brucethemoose22y ago

Its educational. It shows a how llama works in a clear, concise, testable way.

westurner2y ago

Writing one's own and/or porting every line of code has great value

j / k navigate · click thread line to collapse

10 comments

andy992y ago

I made a jupyter notebook "llama2.ipynb" from the Karpathy project: https://github.com/rbitr/llama2.ipynb

atairovOP2y ago

1.3 tok / sec is something similar to my Python version port performance, but I tried on M1 Max

Bostonian2y ago

The llama2.py code defines its own accum, rmsnorm and matmul. Why not use NumPy? A "pure Python" code that is much slower than one using NumPy is less interesting to me.

atairovOP2y ago

If your goal is to make it as fast as possible, then for sure Python implementation is not a solution here. I think for this exactly reason llama.cpp got high attention

behnamoh2y ago

I find these efforts impressive, but what is the value proposition here? (I'm not just talking about this fork, but also Karapathy's llama2.c as well).

atairovOP2y ago

brucethemoose22y ago

> What if we can execute fine-tuned Llama7B on our phones?

7B and 13B are already quite performant with mlc-llm (which uses an Apache TVM Vulkan/Metal backend). Llama.cpp has the potential to perform well too.

These "single file" implementations are not meant to be optimized or feature rich, I dont think.

brucethemoose22y ago

Its educational. It shows a how llama works in a clear, concise, testable way.

westurner2y ago

Writing one's own and/or porting every line of code has great value

j / k navigate · click thread line to collapse