Low-Rank Pruning of Llama2 (opens in new tab)

(mobiusml.github.io)

2 pointsibuildthings2y ago3 comments

3 comments

I'm sharing a blog post https://mobiusml.github.io/low-rank-llama2/ on our approach to pruning the Llama2 model by leveraging low-rank structures.

In a nutshell, we've managed to reduce the model's parameter count by up to 50%, double the training speed, and increase inference speed by 1.25 times.

For those interested in the technical details or looking to replicate our results, the code is openly available for community use and contributions

brucethemoose22y ago

Cool! But the GitHub repo isnt visible for me yet.

Also, can y'all dumb it down for a simple end user like me? Is this actually distilling the model down to a smaller parameter count, or is it just reducing VRAM/compute during training and during inference with a lora? Or something else?

ibuildthingsOP2y ago

Github repo should be visible now.

It is not distilling the model, it is reducing the model weights on the fly and uses LoRA for training/fine-tuning. After the training phase, we explain how to merge the LoRA weights with the pruned weights to achieve faster inference speed

j / k navigate · click thread line to collapse

3 comments

ibuildthingsOP2y ago

I'm sharing a blog post https://mobiusml.github.io/low-rank-llama2/ on our approach to pruning the Llama2 model by leveraging low-rank structures.

In a nutshell, we've managed to reduce the model's parameter count by up to 50%, double the training speed, and increase inference speed by 1.25 times.

For those interested in the technical details or looking to replicate our results, the code is openly available for community use and contributions

brucethemoose22y ago

Cool! But the GitHub repo isnt visible for me yet.

ibuildthingsOP2y ago

Github repo should be visible now.

j / k navigate · click thread line to collapse