Try writing a matmul operation in C++ and profile it against the same thing done in Numpy/Pytorch/TensorFlow/Jax. You’ll be surprised.
As soon as you step out of the happy path and need to do any calculation that isn't at least n^2 work for every single python call you are looking at order of magnitude speed differences.
Years ago now (so I'm a bit fuzzy on the details) a friend asked me to help optimize some python code that took a few days to do one job. I got something like a 10x speedup using numpy, I got a further 100x speedup (on the entire program) by porting one small function from optimized numpy to completely naive rust (I'm sure c or c++ would have been similar). The bottleneck was something like generating a bunch of random numbers, where the distribution for each one depended on the previous numbers - which you just couldn't represent nicely in numpy.
What took 2 days now took 2 minutes, eyeballing the profiles I remember thinking you could almost certainly get down to 20 seconds by porting the rest to rust.
There's nothing here for a DB to really help with, the data access patterns are both trivial and optimal. IIRC it was also more like a billion rows so I'd have some scaling questions (a big enough instance could certainly handle it, but the hardware actually being used was a cheap laptop).
Even if there was though - I would have been very hesitant to do so. The not-a-fulltime-programmer PhD student whose project this was really needed to be able to understand and modify the code. I was pretty hesitant to even introduce a second programming language.
Call overhead and loop overhead is pretty big in Python though. The way to work around that in Python is to use C-based "primitives", like the stuff from itertools and all the builtins for set/list/hash processing (thus avoiding the n^2 case in pure Python). And when memory is an issue (preallocating large data structures can be slow as well), iterators! (Eg. compare use of range() in newer Python with use of list(range())).
And if I recall correctly there was no allocation in the hot loop, with a single large array being initialized via numpy to store the values before hand. Certainly that's one of the first things I would think to fix.
I was strongly convinced at the time that there was no significant improvement left in python. With >99% of the time being spent in this one function, and no way to move the loop into native code given the primitives available from numpy. Admittedly I could have been wrong, and I'm not about to revisit the code now, since it has been years and it is no longer in use - so everything I'm saying is based off of years old memories.
The code I've written can complete 1.7 million evaluations per core, per second, on older hardware, which is used to evaluate things up to 1e-6 accuracy, which pretty neat for what I'm working on.
https://github.com/numpy/numpy/blob/main/numpy/core/src/mult...
Your assertion was that numpy etc will be faster than something else despite being python:
> Try writing a matmul operation in C++ and profile it against the same thing done in Numpy/Pytorch/TensorFlow/Jax. You’ll be surprised.
I mean TensorFlow is c++/cuda!
https://www.boost.org/doc/libs/1_75_0/libs/numeric/ublas/doc...
I think BLAS (a C version, not the boost one) is also the library numpy is using, as numpy is not written in python. That's why it is fast, it is C.