I'm sharing a blog post
https://mobiusml.github.io/low-rank-llama2/ on our approach to pruning the Llama2 model by leveraging low-rank structures.
In a nutshell, we've managed to reduce the model's parameter count by up to 50%, double the training speed, and increase inference speed by 1.25 times.
For those interested in the technical details or looking to replicate our results, the code is openly available for community use and contributions