undefined | Better HN

0 pointsp1esk3y ago0 comments

Which “accelerated” libraries for matrix operations are you talking about?

Try writing a matmul operation in C++ and profile it against the same thing done in Numpy/Pytorch/TensorFlow/Jax. You’ll be surprised.

0 comments

gpm3y ago

This is because numpy and friends are really good at matmul's.

As soon as you step out of the happy path and need to do any calculation that isn't at least n^2 work for every single python call you are looking at order of magnitude speed differences.

Years ago now (so I'm a bit fuzzy on the details) a friend asked me to help optimize some python code that took a few days to do one job. I got something like a 10x speedup using numpy, I got a further 100x speedup (on the entire program) by porting one small function from optimized numpy to completely naive rust (I'm sure c or c++ would have been similar). The bottleneck was something like generating a bunch of random numbers, where the distribution for each one depended on the previous numbers - which you just couldn't represent nicely in numpy.

What took 2 days now took 2 minutes, eyeballing the profiles I remember thinking you could almost certainly get down to 20 seconds by porting the rest to rust.

rubyskills3y ago

Have you tried porting the problem into postgres? Not all big data problems can be solved this way but I was surprised what a postgres database could do with 40 million rows of data.

gpm3y ago

I didn't, I don't think using a db really makes sense for this problem. The program was simulating a physical process to get two streams of timestamps from simulated single-photon detectors, and then running a somewhat-expensive analysis on the data (primarily a cross correlation).

There's nothing here for a DB to really help with, the data access patterns are both trivial and optimal. IIRC it was also more like a billion rows so I'd have some scaling questions (a big enough instance could certainly handle it, but the hardware actually being used was a cheap laptop).

Even if there was though - I would have been very hesitant to do so. The not-a-fulltime-programmer PhD student whose project this was really needed to be able to understand and modify the code. I was pretty hesitant to even introduce a second programming language.

necovek3y ago

That's definitely quite curious: I am sure pure Python could have been heavily optimized to reach 2 minutes as well, though. Random number generation in Python is C-based, so while the pseudo-random generators from Python's random module might be slow, it's not because of Python itself (https://docs.python.org/3/library/random.html is a different implementation from https://man7.org/linux/man-pages/man3/random.3.html).

Call overhead and loop overhead is pretty big in Python though. The way to work around that in Python is to use C-based "primitives", like the stuff from itertools and all the builtins for set/list/hash processing (thus avoiding the n^2 case in pure Python). And when memory is an issue (preallocating large data structures can be slow as well), iterators! (Eg. compare use of range() in newer Python with use of list(range())).

gpm3y ago

I'm reasonably sure the PRNG being used in the python version came from numpy and was implemented in C (or other native code, not python). The problem was that the necessary control flow and varying parameters around it meant you had to call it once per value from python (and you had to generate a lot of values).

And if I recall correctly there was no allocation in the hot loop, with a single large array being initialized via numpy to store the values before hand. Certainly that's one of the first things I would think to fix.

I was strongly convinced at the time that there was no significant improvement left in python. With >99% of the time being spent in this one function, and no way to move the loop into native code given the primitives available from numpy. Admittedly I could have been wrong, and I'm not about to revisit the code now, since it has been years and it is no longer in use - so everything I'm saying is based off of years old memories.

2 more replies

bayindirh3y ago

The code I've written and still working on is using Eigen, which TensorFlow also uses for its matrix operations, so, I'm not far off from these guys in terms of speed, if not ahead.

The code I've written can complete 1.7 million evaluations per core, per second, on older hardware, which is used to evaluate things up to 1e-6 accuracy, which pretty neat for what I'm working on.

[0]: https://eigen.tuxfamily.org/index.php?title=Main_Page

easytiger3y ago

Doesn't numpy use a natively compiled Fortran or c library for that?

https://github.com/numpy/numpy/blob/main/numpy/core/src/mult...

p1eskOP3y ago

Why should I care what Numpy is written in? All I see is Python.

easytiger3y ago

Because it is like saying you use a bash script to configure and launch a c++ application and saying it is a bash script. Python is not a high performance language, it isn't meant to be and it's strengths lie elsewhere. One of it's great strengths is interop with c libs.

Your assertion was that numpy etc will be faster than something else despite being python:

> Try writing a matmul operation in C++ and profile it against the same thing done in Numpy/Pytorch/TensorFlow/Jax. You’ll be surprised.

I mean TensorFlow is c++/cuda!

1 more reply

gilbetron3y ago

Aren't the fast parts of numpy written in C?

gpm3y ago

And fortran. Which really doesn't matter that much as long as that doesn't leak to the users of numpy, and it doesn't really. The only issue is that it means if you're doing something that doesn't fit the APIs exposed by the native code (in a way where the hot loops are in native code) it's roughly as slow as normal python.

gilbetron3y ago

But it does for the argument of a language being fast, which is what we are talking about here. I don't think it is an appropriate argument to say "Python is fast, look at numpy", when the core pieces are written in C/Fortran. It is disingenuous, at least to me.

Shorel3y ago

The canonical answer would be uBLAS.

https://www.boost.org/doc/libs/1_75_0/libs/numeric/ublas/doc...

I think BLAS (a C version, not the boost one) is also the library numpy is using, as numpy is not written in python. That's why it is fast, it is C.

j / k navigate · click thread line to collapse

0 comments

gpm3y ago

This is because numpy and friends are really good at matmul's.

As soon as you step out of the happy path and need to do any calculation that isn't at least n^2 work for every single python call you are looking at order of magnitude speed differences.

What took 2 days now took 2 minutes, eyeballing the profiles I remember thinking you could almost certainly get down to 20 seconds by porting the rest to rust.

rubyskills3y ago

Have you tried porting the problem into postgres? Not all big data problems can be solved this way but I was surprised what a postgres database could do with 40 million rows of data.

gpm3y ago

necovek3y ago

gpm3y ago

2 more replies

bayindirh3y ago

The code I've written and still working on is using Eigen, which TensorFlow also uses for its matrix operations, so, I'm not far off from these guys in terms of speed, if not ahead.

The code I've written can complete 1.7 million evaluations per core, per second, on older hardware, which is used to evaluate things up to 1e-6 accuracy, which pretty neat for what I'm working on.

[0]: https://eigen.tuxfamily.org/index.php?title=Main_Page

easytiger3y ago

Doesn't numpy use a natively compiled Fortran or c library for that?

https://github.com/numpy/numpy/blob/main/numpy/core/src/mult...

p1eskOP3y ago

Why should I care what Numpy is written in? All I see is Python.

easytiger3y ago

Your assertion was that numpy etc will be faster than something else despite being python:

> Try writing a matmul operation in C++ and profile it against the same thing done in Numpy/Pytorch/TensorFlow/Jax. You’ll be surprised.

I mean TensorFlow is c++/cuda!

1 more reply

gilbetron3y ago