Making Python faster with Rust (opens in new tab)

(ohadravid.github.io)

387 pointsteddykoker3y ago215 comments

215 comments

Good for you! You did everything right: measure always, fix the bottleneck if possible, rewrite if necessary.

A little tip, you don't have to compare actual distances, you can compare squared distances just as well. Then in `norm < max_dist`, you don't have to do a `sqrt()` for every `norm`. Saves a few CPU ticks as well.

I once rewrote a GDI+ point transformation routine in pure C# and got 200x speedup just because the routine was riddled with needless virtual constructors, copying type conversions, and something called CreateInstanceSlow. Ten years after, I gathered a few of these anecdotes and wrote the Geometry for Programmers book (https://www.manning.com/books/geometry-for-programmers) with its main message: when you know geometry behind your tools, you can either use them efficiently, or rewrite them completely.

masklinn3y ago

The author did talk about it on reddit, but explained that for the purpose of the blog post they wanted to focus on the big stuff and profiling-guided optimisation of the process: https://reddit.com/r/rust/comments/125pbq0/blog_post_making_...

ZeroCool2u3y ago

The ignore the square root while computing/comparing distances trick is a great one. That's how I got to the top of the performance leaderboard in my first algorithms class.

appeldorian3y ago

I think a big mistake in the article, in a context where performance is the main objective, is that the author uses an array of structs (AoS), rather than a struct of arrays (SoA). An SoA makes it so that the data is ordered contiguously, which is easy to read for the CPU, while an AoS structure interleaves different data (namely the x and y in this case), which is very annoying for the CPU. A CPU likes to read chunks of data (for example 128 bits of data/read) and to process these with SIMD instructions, executing a multiple of calculations with one CPU cycle. This is completely broken when using an array of structs.

He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.

tremon3y ago

Apache Arrow (https://arrow.apache.org/overview/) is built exactly around this idea: it's a library for managing the in-memory representation of large datasets.

ohr3y ago

Author here: I agree, that's a great perf advice (esp. when you can restructure your code).

I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.

One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.

The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.

Yoric3y ago

Having used this approach in a few languages, I agree that it's (sometimes) (much) better for performance, but it tends to wreak havoc on readability.

snicker73y ago

Languages and with some sort of decent metaprogramming support can alleviate this sort of issue (see Zig, Julia, Jai, etc.)

1 more reply

prirun3y ago

Modern CPU caches are usually loaded in 64-byte units - much larger than 128 bits. I just ran some tests with a C program on an Intel I5 with both AoS and SoA using a list of 1B points with 32-bit X and Y components. Looping through the list of points and totaling all X and Y components was the same speed with either AoS or SoA.

It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.

Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.

  [jim@mbp ~]$ sh -x x
  + cat x1.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x + p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x1 x1.c
  + ./x1
  s=1808348672

  real 0m12.078s
  user 0m7.319s
  sys 0m4.363s
  + ./x1
  s=1808348672

  real 0m9.415s
  user 0m6.677s
  sys 0m2.685s
  + cat x2.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      x[i] = i;
      y[i] = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += x[i] + y[i];
    }
    printf("s=%d\n", s);
  }
  + cc -o x2 x2.c
  + ./x2
  s=1808348672

  real 0m9.753s
  user 0m6.713s
  sys 0m2.967s
  + ./x2
  s=1808348672

  real 0m9.642s
  user 0m6.674s
  sys 0m2.902s
  + cat x3.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
    }
    for (i=0; i<NUM; i++) {
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x;
    }
    for (i=0; i<NUM; i++) {
      s += p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x3 x3.c
  + ./x3
  s=1808348672

  real 0m13.844s
  user 0m11.095s
  sys 0m2.700s
  + ./x3
  s=1808348672

  real 0m13.686s
  user 0m11.038s
  sys 0m2.611s
  + cat x4.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++)
      x[i] = i;
    for (i=0; i<NUM; i++)
      y[i] = i;

    s=0;
    for (i=0; i<NUM; i++)
      s += x[i];
    for (i=0; i<NUM; i++)
      s += y[i];

    printf("s=%d\n", s);
  }
  + cc -o x4 x4.c
  + ./x4
  s=1808348672

  real 0m13.530s
  user 0m10.851s
  sys 0m2.633s
  + ./x4
  s=1808348672

  real 0m13.489s
  user 0m10.856s
  sys 0m2.603s

lights01233y ago

It appears that you didn't enable optimizations. That's needed for SIMD, which can only be taken advantage of with contiguously packed data.

1 more reply

b0b101013y ago

This is a great article but there's still a core problem there - why should developers have to choose between accessibility and performance?

So much scientific computing code suffers between core packages being split away from their core language - at what point do we stop and abandon python for languages which actually make sense? Obviously julia is the big example here, but its interest, development and ecosystem doesn't seem to be growing at a serious pace. Given that the syntax is moderately similar and the performance benefits are often 10x what's stopping people from switching???

fbdab1033y ago

Today, there is a Python package for everything. The ecosystem is possibly best in class for having a library available that will do X. You cannot separate the language from the ecosystem. Being better, faster, and stronger means little if I have to write all of my own supporting libraries.

Also, few scientific programmers have any notion of what C or Fortran is under the hood. Most are happy to stand on the shoulders of giants and do work with their specialized datasets. Which for the vast majority of researchers are not big data. If the one-time calculation takes 12 seconds instead of 0.1 seconds is not a problem worth optimizing.

koito173y ago

>Today, there is a Python package for everything.

The same could be said about CPAN and NPM. Yet Perl is basically dead and JavaScript isn't used for any machine learning tasks as far as I'm aware. WebAssembly did help bring a niche array of audio and video codecs to the ecosystem[1][2], something I'm yet to see from Python.

I don't use Python, but with what little exposure I've had to it at work, its overall sluggish performance and need to set up a dozen virtualenvs -- only to dockerize everything in cursed ways when deploying -- makes me wonder how or why people bother with it at all beyond some 5-line script. Then again, Perl used to be THE glue language in the past and mod_perl was as big as FastAPI, and Perl users would also point out how CPAN was unparalleled in breadth and depth. I wonder if Python will follow a similar fate as Perl. One can hope :-)

[1] https://github.com/phoboslab/jsmpeg

[2] https://github.com/brion/ogv.js/

3 more replies

el_oni3y ago

This is how I got into software development.

During my PhD I was running some simulations using poorly written python code. initially it would take several hours. In that time i could go to the lab, run some wetlab experiments and the results of my simulations would be there when i got back to the office. It was only taking python "home" and building some of my own projects that i learned how to 1. write more pythonic code and 2. write more performant code. Now i work for a software company.

If i'd have stayed in in academia I would probably still be writing quick and dirty code and not worrying about the runtime because as a researcher there is always something else you can be doing.

onetonfoot3y ago

You can have your cake and eat it with the likes of

* PythonCall.jl - https://github.com/cjdoris/PythonCall.jl

* NodeCall.jl - https://github.com/sunoru/NodeCall.j

* RCall.jl - https://github.com/JuliaInterop/RCall.jl

I tend to use Julia for most things and then just dip into another language’s ecosystem if I can’t find something to do the job and it’s too complex to build myself

1 more reply

bakuninsbart3y ago

Because professional software developers with a background in CS are a minority of people who program today. The learning curve of pointers, memory-allocation, binary operations, programming paradigms, O-Notation and other things you need to understand to efficiently code in something like C is a lot to ask of someone who is for example primarily a sociologist or biologist.

The use case btw. is often also very different. In most of academia, writing code is basically just a fancy mode of documentation for what is basically a glorified calculator. Readability trumps efficiency by a large margin every time.

visarga3y ago

It also matters if you write code to run once or to serve in production, if it is experimental or stable.

If my script takes 3s to run and 5m to write in Python, vs 0.1s to run and 3h to write in C, I finish first with Python. I can try more ideas with Python.

Barrin923y ago

tbf you don't need to go to C. You could write Common Lisp or Ocaml, both academic high level languages and very performant. Hell SBCL can get you to C range performance wise while you're writing dynamic, GCed code. Sure it's a little bit more involved than learning Python but not that much if you get 50x performance for free. Prevalence of Python is really baffling to me because compute resources cost money.

1 more reply

akasakahakada3y ago

Not even readability. Academic code is mostly unreadable. If you need example: IBM Qiskit.

Everything is just a prove of concept and no one expect anything more than that.

kaba03y ago

C is definitely not a good choice for this, I would hate to come back 2 days later to my computation and see “segfault” as the only output.

1 more reply

brahbrah3y ago

They would have gotten the same performance in python with numpy if they did it like this instead of calling norm for every polygon

centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)

They were just using numpy wrong. You can be slow in any language if you use the tools wrong

masklinn3y ago

You made this assertion multiple times, but so far it’s been entirely unsupported in fact, despite TFA having made the entire code set available for you to test your hypothesis on.

3 more replies

tayo423y ago

what is the difference?

though I do feel like i see this a lot with these kinds of "we re-wrote it in rust and everything is fast". comparing to a language with gc options often the scenario

on one hand, i feel like you should just learn how to use your stuff properly. on the other hand it is interesting to see that people who can't write fast code or use libraries properly are actually writing fast code. like fast code for the masses almost hah. though maybe theyll just run into the same issue when they misuse a library in rust

1 more reply

ThouYS3y ago

everything. why are there still cobol programmers? why is c++ still the defacto native language (also in research)?

but also I don't see any problem there, I think the python + c++/rust idiom is actually pretty nice. I have a billion libs to choose from on either side. Great usability on the py side, and unbeatable performance on the c++ side

ubj3y ago

One of Julia's Achilles heels is standalone, ahead-of-time compilation. Technically this is already possible [1], [2], but there are quite a few limitations when doing this (e.g. "Hello world" is 150 MB [6]) and it's not an easy or natural process.

The immature AoT capabilities are a huge pain to deal with when writing large code packages or even when trying to make command line applications. Things have to be recompiled each time the Julia runtime is shut down. The current strategy in the community to get around this seems to be "keep the REPL alive as long as possible" [3][4][5], but this isn't a viable option for all use cases.

Until Julia has better AoT compilation support, it's going to be very difficult to develop large scale programs with it. Version 1.9 has better support for caching compiled code, but I really wish there were better options for AoT compiling small, static, standalone executables and libraries.

[1]: https://julialang.github.io/PackageCompiler.jl/dev/

[2]: https://github.com/tshort/StaticCompiler.jl

[3]: https://discourse.julialang.org/t/ann-the-ion-command-line-f...

[4]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[5]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[6]: https://www.reddit.com/r/Julia/comments/ytegfk/size_of_a_hel...

meepmorp3y ago

Thank you for settling a question for me - I was looking at julia's aot compilation abilities last week and the situation seemed like kind of a hassle.

shakow3y ago

IME, for having used Julia quite extensively in Academia:

- the development experience is hampered by the slow start time;

- the ecosystem is quite brittle;

- the promised performances are quite hard to actually reach, profiling only gets you so far;

- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)

> what's stopping people from switching???

All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.

ActorNightly3y ago

I don't know whether this sentiment is just a byproduct of CS education, but for some reason people equate a programming language with the compute that goes on under the hood. Like if you write in Python, you are locked into the specific non optimized way of computing that Python does.

Its all machine code under the hood. Everything else on top is essentially description of more and more complex patterns of that code. So its a no brainer that a language that lets you describe those complex but repeating patterns in the most direct way is the most popular. When you use python, you are effectively using a framework on top of C to describe what you need, and then if you want to do something specialized for performance, you go back to the core fundamentals and write it in C.

visarga3y ago

Julia doesn't get the latest models first, or have as big of a community.

korijn3y ago

A vectorized implementation of find_close_polygons wouldn't be very complex or hard to maintain at all, but the authors would also have to ditch their OOP class based design, and that's the real issue here. The object model doesn't lend itself to performant, vectorized numpy code.

pbowyer3y ago

What's a good guide to learn how to make (and see) vectorized code? It's a mindshift and not one I find easy.

appeldorian3y ago

I think a great start is to make arrays of similar data. Instead of an array of (x,y,z) use an array for x, an array for y and another one for z. If you then square these and sum them for example, the compiler might figure out good optimizations for it if you write it as a simple loop.

Also read about SIMD instructions like AVX2. They are often used under the hood when possible, but just knowing what they require can help "triggering" them, depending on which language you use. In C++ for example, the compiler really looks for opportunities to use those instructions. You can tell the compiler did it, by looking in the assembly code if any XMM or YMM registers are being used (these are the names of the SIMD registers).

akasakahakada3y ago

A more accruate keyword for googling is "SIMD". Single Instruction Multiple Data.

Numpy's tutorial for broadcasting is also a good starting point.

https://numpy.org/doc/stable/user/basics.broadcasting.html

korijn3y ago

The gist of it is that you give numpy two arrays, and what operation to apply. Then numpy will figure out what the for loop(s) should look like depending on the shape of the arrays.

You can look at various tutorials to see how it works. For example: https://jakevdp.github.io/PythonDataScienceHandbook/02.05-co...

rwalle3y ago

Exactly, that is the real issue, vectorization might be good enough in terms of performance. It doesn't seem to be mentioned in the article at all.

hn927268193y ago

It might have been added later, but the author mentions vectorization in the beginning:

> It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable, and the gains are going to be limited (here’s a partially vertorized version, which is faster but far from the results we are going to achieve).

1 more reply

FreeHugs3y ago

The most important part of the article seems to be that this Python code is taking "an avg of 293.41ms per iteration":

        def find_close_polygons(
            polygon_subset: List[Polygon], point: np.array, max_dist: float
        ) -> List[Polygon]:
            close_polygons = []
            for poly in polygon_subset:
                if np.linalg.norm(poly.center - point) < max_dist:
                    close_polygons.append(poly)

            return close_polygons

And after replacing it with this Rust code, it is taking "an avg of 23.44ms per iteration":

        use pyo3::prelude::*;
        use ndarray_linalg::Norm;
        use numpy::PyReadonlyArray1;

        #[pyfunction]
        fn find_close_polygons(
            py: Python<'_>,
            polygons: Vec<PyObject>,
            point: PyReadonlyArray1<f64>,
            max_dist: f64,
        ) -> PyResult<Vec<PyObject>> {
            let mut close_polygons = vec![];
            let point = point.as_array();
            for poly in polygons {
                let center = poly
                    .getattr(py, "center")?
                    .extract::<PyReadonlyArray1<f64>>(py)?
                    .as_array()
                    .to_owned();

                if (center - point).norm() < max_dist {
                    close_polygons.push(poly)
                }
            }

            Ok(close_polygons)
        }

Why is the Rust version 13x faster than the Python version?

spi3y ago

Yeah but the Python code is so bad that it's easy to get a 10x speedup using only numpy, as well. The current code essentially does:

    import numpy as np

    n_sides = 30
    n_polygons = 10000

    class Polygon:
        def __init__(self, x, y):
            self.x = x
            self.y = y
            self.center = np.array([self.x, self.y]).mean(axis=1)


    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        close_polygons = []
        for poly in polygon_subset:
            if np.linalg.norm(poly.center - point) < max_dist:
                close_polygons.append(poly)

        return close_polygons

    polygons = [Polygon(*np.random.rand(2, n_sides)) for _ in range(n_polygons)]
    point = np.array([0, 0])
    max_dist = 0.5

    %timeit find_close_polygons(polygons, point, max_dist)

(I've made up number of sides and number of polygons to get to the same order of magnitude of runtime; also I've pre-computed centers, as they are cached anyway in their code), which on my machine takes about 40ms to run. If we just change the function to:

    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        centers = np.array([polygon.center for polygon in polygon_subset])
        mask = np.linalg.norm(centers - point[None], axis=1) < max_dist
        return [
            polygon
            for polygon, is_pass in zip(polygon_subset, mask)
            if is_pass
        ]

then the same computation takes 4ms on my machine.

Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.

(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)

winrid3y ago

Python's for loop implementation is slow, also. You can use built in utils like map() which are "native" and can be a lot faster than a for loop with a push:

https://levelup.gitconnected.com/python-performance-showdown...

akasakahakada3y ago

Nope. Map() is same speed as for loop.

Benchmarking methodology in the link is not good. Author should use timeit() or cProfiler or so. 0.01s of difference is mostly due to fluctuation. The order of execution also matters. Say you want to test A and B function, you need actually to run A, B, B, A to see if the ordering brings the different.

1 more reply

FreeHugs3y ago

I don't think it's the loop implementation. The stuff in the loop should take multiple orders of magnitude more time than the loop itself:

    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)

1 more reply

tweakimp3y ago

Its not the looping itself that is slow in the article you linked, its that every element is appended to the list. If you use a list comprehension its even faster and it still loops over all elements of the list.

2 more replies

hannofcart3y ago

I was surprised that the Rust version is _only_ 13x as fast as the Python version.

IshKebab3y ago

Probably because it wasn't pure Python to start with.

lenkite3y ago

Shouldn't `close_polygons` be presized in both Python and Rust to avoid repeated allocation and copy ?

nickstinemates3y ago

One carries the entire feature set of the python runtime, the other is compiled.

FreeHugs3y ago

The time is spent in this 3-line loop:

    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)

I don't think the entire feature set of the Python runtime is involved in this.

1 more reply

hoseja3y ago

The final code takes just 2.90ms per iteration.

FreeHugs3y ago

The rest is not a fair comparison, because it rewrites the used libraries, not the application code.

You can always speed up an application if you rewrite the used libraries to match your specific use case.

2 more replies

ssivark3y ago

Making Python (near infinitely) faster by using it as a glue language, and running all the computation outside Python :-P

INTPenis3y ago

Yeah what's wrong with that? I think this sounds amazing. It gives you all the fast prototyping and simplicity of Python, but once you hit that bottleneck all you have to do is bring in a ringer to replace key components with a faster language.

No need to use Golang or Rust from the start, no need for those resources until you absolutely need the speed improvement. Sounds like a dream to a lot of people who find it much easier to develop in Python.

josephg3y ago

It sounds amazing, but bear in mind there are a lot of code which can’t be sped up like this because:

- Some code doesn’t have obvious optimization hotspots, and is instead just generally slow everywhere.

- Most FFI boundaries incur their own performance cost. I’m not sure about Python, but I wouldn’t be surprised if FFI to rust in a hot loop is often slower than just writing the same code in Python directly. And it’s not always easy to refactor to avoid this.

- A lot of programs in languages like Python are slow because the working set size contains a lot of small objects, and the GC struggles. You can optimize code like this by moving large parts of the object graph into rust. But it can become a mess if the objects rust retains then need references to Python objects, in turn.

The optimization described in this blog post is the best case scenario for this sort of thing - the performance hotspot was clear, small, and CPU bound. When you can make optimizations like this you absolutely should. But your mileage may vary when you try this out on your own software.

2 more replies

alex_sf3y ago

Python is a rough language to be productive in. It's a great scratchpad, but dynamic typing, exceptions/poor error handling, and a horrifying deployment and dependency system make me reach for something like Go in any case where I need something to be even vaguely reliable.

The more ML I do, the more disappointed I get.

4 more replies

internetter3y ago

Right? I mean, it's not like we haven't been doing this already. All the computationally intensive python libraries are just a convenient wrapper for C anyway, the only reason python can be used for ML.

za3faran3y ago

Then you give up the benefits of using a managed language, and you now have to maintain two stacks.

IMO/IME much better to go with a language where you don't have this dichotomy in the first place - e.g. Java or C#.

okaleniuk3y ago

Also, there is a faster Python which is also Python. And the author considered it as well (both PyPy and Numba), it's just in this particular scenario they were not the best way to go.

baq3y ago

This is basically what Python was first designed for and as evidenced by the article still excels at

flohofwoe3y ago

Well, that's exactly where Python works well. As scripting glue sitting between or above native code which does the heavy-lifting.

est3y ago

> and running all the computation outside Python

You don't have to move all the computation, just `for` loops will help alot.

majoe3y ago

I had a similar problem, when I was working as a PhD student a few years ago, where I needed to match the voxel representation of a 3D printer with the tetrahedral mesh of our rendering application.

My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.

Since this was only a small standalone script, I rewrote it in Julia in a day. The end result was ca. 100x faster and the code a lot cleaner, because I could just implement the core logic for one tetrahedron and then use Julia's broadcast to apply it to the array of tetrahedrons.

Anyway, Julia's long startup time often prohibits it from being used inside other languages (even though the Python/Julia interoperability is good). On the contrary the Rust/Python interop presented here seems to be pretty great. Another reason I should finally invest the time to learn Rust.

hgomersall3y ago

Numba is great if you want to write a naive loop approach in python.

dunefox3y ago

Long startup time is relative. I believe it's much lower now than a couple of versions ago. 0.15s or so? Interop between python and rust will also take time.

xgdgsc3y ago

Julia 1.9 is fast. And you can use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia now. So I think after 1.9 release julia would be much more usable.

brahbrah3y ago

This was a silly and unnecessary optimization. He’s just using numpy wrong.

Instead of:

for p in ps: norm(p.center - point)

You should do:

centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)

You’ll get your same speed up in 2 lines without introducing a new dependency

_glass3y ago

Isn't this the version of refenced on the github repo [0] which speeds up 6x instead of 101x?

  There's also a "v1.5" version which is 6x faster, and uses "vectorizing" (doing more of the work directly in numpy). This version is much harder to optimize further.

[0] https://github.com/ohadravid/poly-match

brahbrah3y ago

No, their v1.5 is still calling norm on every polygon. They’re still using it wrong

On Google colab

    import numpy as np
    import time


    vals = np.random.randn(1000000, 2)
    point = np.array([.2, .3])
    s = time.time()
    for x in vals:
        np.linalg.norm(x - point) < 3
    a = time.time() - s

    s = time.time()
    np.linalg.norm(vals - point, axis=1) < 3
    b = time.time() - s

    print(a / b)

~296x faster, significantly faster than the solution in the article.

jerf3y ago

This is the major reason I don't really buy into things like JITs solving all performance problems (as long as you carefully write only and exactly the subset of the language they work well with) or NumPy not being affected by Python being slow. There's more code like this in the world than I think people realize.

Having to write in a subset of a language in order for it to perform decently is a big deal. Having no feedback given to the programmer when you deviate from the fast path makes it even harder to learn what the fast path is. The result is not that you get the ease of Python and the speed of C without having to understand much; the result is that you have to be a fairly deep expert in Python and understand the C bindings intimately and learn how to avoid doing what is the natural thing in Python, the Python covered in all the tutorials, or you end up writing your code to run at native Python speeds without even realizing it.

It's a feasible amount of knowledge to have, it's not like it's completely insane, but it's still rather a lot.

My career just brushed this world and I'm glad I bounced off of it. It would drive me insane to walk through this landmine field every day, and then worse, have to try to guide others through it all the while they are pointing at all the "common practices" that are also written by people utterly oblivious to all this.

brahbrah3y ago

This is a very valid point that I can’t disagree with. I’ve gone through the pain of learning that subset of the language decently well, but also been lucky enough to work at places that compensate very well for that knowledge.

oblio3y ago

This is nice, how would you go about as a performance noob? I can't imagine there's a line in the docs saying "this is slow!".

akasakahakada3y ago

LineProfiler is the best tool to learn how to write performant Python and do code optimization.

https://github.com/pyutils/line_profiler

You can literally see the hot spot of your code, then you can grind different algorithms or change the whole architecture to make it faster.

For example replace short for loops to list comprehensions, vectorize all numpy operations (only vectorize partially do not help the issue), using 'not any()' instead or 'all()' for boolean, etc.

Doing this for like 2 weeks, basically you can automatically recognize most bad code patterns at a glance.

1 more reply

brahbrah3y ago

So in addition to what akasaka said (another thumbs up for line profiler from me, great tool) this isn’t a problem with linalg.norm being slow. It’s plenty fast, but calling it thousands of separate times in a Python loop will be slow. This is more just about learning how to vectorize properly. If you’re working in numpy land and you’re calling a numpy function in a loop that’s iterating over more than a handful of items, chances are you’re not vectorizing properly

1 more reply

thanatropism3y ago

I feel a lot of the "Python perf" thing is an inferiority complex. cPython is getting faster all the time, and (obviously using libraries like numpy and others that hook into compiled code) I don't think I've ever seen it become a business bottleneck. If it's ever a scaling problem, then you hire lower-level language devs, it's a good problem to have.

Python is much, much easier to learn, and Rust is notoriously difficult. This obviously feeds into the inferiority complex. But -- while I do know there's a number of applications where perf is crucial -- I think it's well worth doing an ego check before moving from what's a lower friction path and gives you access to myriad developers, including ones trained in very hard disciplines.

Also: does it ever make sense to write something like a CRUD+ backend in Rust? Maybe 100X Rustaceans can do it with one hand tied behind their backs; but imagine what these ubermenschen could be achieving in Python?

karamanolev3y ago

We're working as an internal dev team for a moderately large company and the challenge is that very few people have to maintain a ton of business automation. We love our Python+Django stack as it gives it a ton of productivity, but as our business and data grows, some things are becoming quite slow.

For now, we're getting away with Python-only optimizations and relying on cPython improvements. Still, some processes take a few minutes and a lot of that is spent just because Python is quite slow. In that sense, yes, it happens for CRUD+ applications (which I consider our CRM+ERP to be).

faitswulff3y ago

I think you might be surprised. I would say Rust isn't notoriously difficult per se, it's just harder to please the compiler. But that's still viewing the compiler as an adversary when it's more like an assistant, so the analogy breaks down. You don't have to be a 100x engineer to use Rust. In fact, quite the opposite. Rust gives engineers a lot more guardrails to prevent what would be runtime errors in other languages.

insanitybit3y ago

I think for a lot of people, like myself, writing Rust is just as easy as writing Python (actually it's way easier IMO). So when people write Python it's sort of like "uh, why would you choose the slower tool?".

Obviously the answer is "because I'm faster with Python" but then part of my brain is "well get good idk".

pjmlp3y ago

If writing Tcl extensions in C during the .com wave 23 years ago taught me something, was that glue languages are great, and that I don't want to use any that doesn't come with either AOT or JIT compiler in the box, other than for OS scripting tasks.

JodieBenitez3y ago

> Also: does it ever make sense to write something like a CRUD+ backend in Rust?

Well... given 2 frameworks equally pleasant to work with, why not using the one with the best performance ? (Whatever performance means... less cpu ? less memory ? better i/o ?).

As a Django user, working on problems where Django shines, I have yet to see such a solution in Rust, but that doesn't mean it won't happen one day.

rhodysurf3y ago

I use it for backends that do a lot of math because the python deployment story sucks, rust is faster anyways, and I love rust (do python and js by day). It’s been so easy to bring up rust containers on serverless and they just don’t break. The cold start times are also insanely good compared to python

HacklesRaised3y ago

Whatever performance means is really interesting. One of the few unquestionable benefits of cloud computing is that it is now easier than ever to measure the 'running' cost of a program / solution.

What Amazon charges for, say a t2.nano must factor in the cost of the hardware, power, cooling, and system adminstration, so it is technically possible to evaluate the running cost of a piece of software.

If I, as the ruthless capitalist that I am pretending to be, made the running cost of the solution an integral part of annual remuneration, through bonus calculation say, there will be a tipping point when that cost has an impact on techology strategy.

I know it sounds remote but a similar scheme has/was implemented in some investment banks, where bonus calculations introduced a vesting schedule and factored in the back office cost of maintaining the trade over its lifetime rather than being based entirely on revenue.

Let's start with a totatally reasonable caculation made by a non technical user , with absolutely no understanding of the problem at hand, and by that I mean completely unfair and arbitrary, so pretty much the real world.

The TechEmpower multiple request benchmarks tell me that Rust is the performance king. Go offers about 2/3 that performance, and python 1/3, that bit doesn't sound too unreasonable, perhaps a little unkind to Go, but I digress; given these relative performance numbers, if the running cost bonus pool is $x, the maximum amount you will receive for a python based solution is 1/3x.

At what point does the financial impact influence our technology strategy in a way that perhaps our otherwise virtuous proseltysing about the need to reduce our carbon footprint doesn't?

FWIW I am continually suprised that Jon Blow, Casey Muratori , Mike Acton et al, aren't beating exactly this drum. It's not that their arguments about writing high performance software are not compelling in their own right, but appealing to peoples otherwise orthogonal concerns / passions seems to be a logical next step.

Sorry to ramble on!!

karussell3y ago

I wonder why GraalVM is not more often used for these speed critical cases: https://www.graalvm.org/python/ (Same for ruby https://www.graalvm.org/ruby/)

Is the problem the Oracle involvement? Or is it not that fast as advertised or problems with the ecosystem (C libraries)?

thisgoodlife3y ago

My concern is that it’s not ready for prod yet.

“At this point, the Python runtime is made available for experimentation and curious end-users. “

https://www.graalvm.org/latest/reference-manual/python/

dunefox3y ago

I would not use anything made by oracle if I have the choice. My team sees it similarly.

pjmlp3y ago

Better clean up the Linux kernel from Oracle contributions then, in case you are using it.

https://lwn.net/Articles/915435/

1 more reply

cornholio3y ago

Language design request:

Take from Rust Algebraic types, ahead of time compilation and strong types, functional features, a borrow / escape checker that automatically turns shared data into Rc or Arc, as necessary, instead of tormenting me to rewrite performance irrelevant code;

Take from Python the simple syntax, default pass by reference of all non-numeric types, simplified string handling, unified slice and array syntax - and any other simplifying feature possible.

... and give me a fast, safe and powerful language that gets out of my way while maximizing the power of the compiler to prevent bugs. Golang was a commendable attempt, but they made '70s design decisions that condemned the language: mandatory garbage collection, nullables, (void*) masquerading as interface{} casts, under-powered compiler etc.

machiaweliczny3y ago

I think Nim is closest to this wishlist

elcritch3y ago

Pretty much, ARC is RC without needing to manually write it. It can be as compact as Python and near C++/Rust speeds with some optimization. Then add in macros for real performance tricks like simd. Julias pretty nice as well.

za3faran3y ago

Modern Java and modern C# should be closer to what you described.

On a side note, python is strictly pass by value. For non-primitives, their references are passed by value.

mijoharas3y ago

Sounds like swift to me

jcolella3y ago

This is possibly one of the best written articles end-to-end I have read. Excellent job telling the story

brahbrah3y ago

Agreed it was well written, but kinda pointless though since they could have “solved” the problem using the existing tools in a couple lines of code without any new deps. All that content annd profiling and they missed the fact that they were using numpy wrong.

tecleandor3y ago

Totally out of curiosity, could you be a bit more concrete in your posted example? I can't get it to work (I'm inexperienced with numpy and I'm messing something when translating your quick example to python)

Thanks!

1 more reply

Dowwie3y ago

Elixir is made faster with Rust, too. Rust is a great skillset to have for those measured moments.

saeranv3y ago

I wonder if being able to quickly retrieve a numpy array of the polygon centers would make an equivalent difference. Since then you could at least retrieve the centers from the polygon as an array you could just use numpy operations for the closest polygon operation:

``` centers = get_centers(polgons) # M x 3 array close_idx = np.where( np.linalg.norm(centers - point, axis=1) < max_dist)[0] close_polygons = polygons[close_idx,:] ```

That's one reason I prefer for to use arrays for polygons, rather then abstract it into a Python object. Fundamentally geometries are sequences of points, and with some zero-padding to account for irregular point counts, you can still keep them in a nice, efficient array representation.

toxik3y ago

Agreed, I speed up Python numpy code with numba quite often and it isn’t at all unreadable to put it in an ndarray subclass.

    poly = Polygon(vertices)

I would bet you can achieve just as much of a speedup with numba or Cython using this form.

swyx3y ago

in the domain of Python tooling made faster with Rust, check out https://github.com/charliermarsh/ruff which is 10-100x faster than pylint etc

moreresearchplz3y ago

Like to see comparison to when they use a spatial index like strtree or rtree.

crabbone3y ago

> Python is a superb API for researchers,

Where does this nonsense come from? No. It isn't. It's just a stupid fashion. Something that should be discouraged, not encouraged by trying to make it work when it's obviously broken.

As someone who does work with researchers who do use Python a lot, I see the everyday painful experiences of people who use it. And this pain doesn't need to be there. It's just masochism. And the only real reason is that they don't know any better. The only other thing they know is Matlab, and that's even worse.

Python is just a bad language. Popular, but awful. Ironically, while researchers are supposed to be on the forerfront of discovery and technology... well, they aren't. Industry outpaced research. So much so that today there are government programs to onboard researchers into more automated and more automatically verified way to do research. And we aren't talking about making an elite force here. These programs are meant for people in research who copy data from Excel sheets one data point at a time into another spreadsheet. It's that kind of bad.

My wife happened to work in such a government center, and that's how I know about what's going on inside these programs. And it's very sad that decisions about the preferred tools for research automation are made by people who, unlike most of their peers, had some exposure to what happens in the industry, but had no deeper understanding of the reasons any particular technology ended up in any particular niche, nor any independent ability to assess the capabilities of any particular tech. It's really sad what's going on there.

ActorNightly3y ago

>Python is just a bad language. Popular, but awful.

You couldn't be more wrong. Python is the language going forward. There is a reason why bleeding edge ML stuff is done through Python, as well as it being the backend to several very popular web platforms and is second most use language on github behind JS solely because JS is hard tied to web.

I have a feeling the hate for Python is just comes from paradigms that are taught in extremely poor CS curriculum in schools. If you think that Python is bad because of dynamic typing, you haven't been paying attention to the direction compute is going.

crabbone3y ago

No, you couldn't be more wrong. Nice argument you dug up there!

Language going forward... I hadn't had such a laugh in a while... It was dead in the 70s. Long before it was born. Are you under influence or something?

> There is a reason why bleeding edge ML stuff is done through Python

No, there isn't, and bleeding edge ML isn't done in Python. It's done mostly in C++ and more seldom in C. Which are another garbage languages, but that's just the reality we live in. Python is just a tiny fraction of what's being done, the tip of the tip of the iceberg.

And, no Python is not the backend of popular platforms, it is, again, a tiny little bit of what's going on in those "popular platforms", and, unlike you, I actually worked for those "popular platforms", so, would know how that is.

But, even, imagine that in your fairy-tale world Python is somehow so super-important and successful? Didn't I already explain how this is possible while still being a garbage language? -- I bet I did. You just rushed to spit your despair and frustration at me, because I offended something that you like... well, you think that you like, but really, you just don't know any better, just like other people using Python :( And that's really the sad part. You lack critical thinking to be really able to tell the quality of your tools. You substituted the ability to judge the quality of something by looking at what the majority is doing and following the herd.

And, I didn't study when Python became taught in school. In my days the language of choice for intro to CS was Visual Basic :| It's also garbage... and the whole idea that the intro to CS has to be done by learning a random fashionable language of the day is just dumb. It's a misunderstanding of SICP, which to someone who didn't understand what the book is about looks as if the author tried to teach the students Scheme, and later intro to CS classes were modeled on this book, but replaced Scheme with another language, while completely forgetting that Scheme was used in the first place as a language with "little syntax", to avoid spending time on learning the language.

You again are confused when you use the expression "dynamic typing", but you don't even know what that is. But you will never accept it because, again, you don't have the ability to critically think, you follow the herd. You red a "definition" somewhere that says that Python is "dynamically typed language", and you believed this nonsense, even though you never really thought about what that might mean...

One thing we agree on though: the direction the compute is going is to utilize more low-skill programmers, and that means, beside other things, following fads more than doing any sensible work. Python is the current fad, and so, for a while, we are stuck with this garbage. And, the way things are going, at best, I can hope for another equally idiotic language to come to replace it. But, probably not very soon.

1 more reply

pahbloo3y ago

What are the best alternatives to Python then?

crabbone3y ago

For research? -- Julia seems to be definitely better. It's purpose-built for doing just that. R would also be there. If you want general statistics, then add J to the fold.

But specific fields often have their own, bespoke solutions. I've only ever dealt with math, but it has plenty of its own niche languages that are much better than, say, Sage. My personal choice was Maxyma, but that's because I like Common Lisp.

Furthermore, it's just the situation today. It doesn't mean that this is what it has to be tomorrow. Any of the languages I used in this domain have their issues, and could still be improved. We are nowhere near a place where it's hard to imagine something better than what we have. So, I'd say, if you really want a very good language, you might as well start building one now -- you have a very good chance yours will be the best one so far.

2 more replies

xgdgsc3y ago

Rust is verbose. I would use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia and usually get similar performance.

shakow3y ago

Does it compile AoT or are you stuck with Julia start time?

wdroz3y ago

One of the others understated pros of rewriting some parts in Rust, it's that you can parallelize easily with Rayon[0]

[0] -- https://github.com/rayon-rs/rayon

wcrossbow3y ago

Rust is great but isn’t the core problem here using the wrong algorithm? It looks like this is ideally suited for a quad tree instead of a naive for loop. I would expect that to pulverise any current benchmark.

flohofwoe3y ago

I guess you also need to take the time into account to create the quad tree from an unsorted 'polygon soup' first, and in terms of coding effort, a brute force conversion from python to a compiled tight loop over unsorted arrays provides a lot of bang for the buck (and a speedup of 100x for relatively little effort might be 'good enough' for quite a while until the input data grows big enough to require the next optimization effort).

(e.g. it doesn't need to be "as fast as possible", just fast enough to no longer be a workflow bottleneck)

wcrossbow3y ago

Im assuming the polygons dont change to often so you can amortize the construction of the quadtree. Depending on your world view the implementation is trivial since Shapely, a dependency they already likely have, has an implementation of it.

1 more reply

wanderingmind3y ago

Why do I need to rewrite in Rust, when I can just use Polars[1] that will cover most usecases [1] https://www.pola.rs/

masklinn3y ago

What is the relation? Polars replaces pandas, not numpy.

wenc3y ago

Polars and DuckDB have replaced Pandas for me for the most part.

yjftsjthsd-h3y ago

> The library was already using numpy for a lot of its calculations, so why should we expect Rust to be better?

I literally clicked in to read the article to see if they'd mention this:) But... unless I missed it, there wasn't really an answer? I thought numpy does do the heavy lifting in native code, so why is this faster? Does this version just push more of the logic into native code than numpy did?

alex_smart3y ago

Numpy is fast when the code is vectorized. The code they are benchmarking against was not vectorized. They wanted to calculated the distances of n points against a given point and find out which points are closer than a threshold (max_dist). Instead of vectorizing the whole operation, the python code was just calling numpy in a loop to find the distance of two points.

Just that small change already gives 10x faster performance without ever leaving python/numpy land.

akasakahakada3y ago

> They wanted to calculated the distances of n points against a given point and find out which points are closer than a threshold (max_dist).

Scipy should have already implemented such thing. Scikit-Learn also. Because KNN clustering is exactly doing this kind of work.

fbdab1033y ago

They did have a choice quote a bit before the blurb you quoted. The real world problem is sufficiently more complex/different enough that vectorizing it would be a pain.

yjftsjthsd-h3y ago

Thanks, I managed to miss that

anewhnaccount23y ago

The article mentioned that there are some gains from using vectorised numpy methods, i.e. spending more time in numpy code. I would be interested in whether the List[Polygon] could be converted into two long arrays of all xs and all ys with indices into the starts (essentially a dense representation of a sparse array) and then the core function rewritten for Numba since it could now not use any Python objects. This would break the interface of course, but may get within striking distance of the Rust implementation.

biomcgary3y ago

The slowness comes from the interaction of numpy and a Python object "Polygon", which in not numpy. I suspect that a sufficiently clever coder could have optimized the result without resorting to Rust, but at the cost of a substantial increase in complexity of the codebase. The proposed approach keep the Python code simple (and moves the complexity into having another language to deal with).

alex_smart3y ago

    diff --git a/poly_match_v1.py b/poly_match_v1.py
    index 675c88a..4293a46 100644
    --- a/poly_match_v1.py
    +++ b/poly_match_v1.py
    @@ -1,4 +1,5 @@
     from functools import cached_property
    +from itertools import compress
     from typing import List, Tuple
     import numpy as np
     from dataclasses import dataclass
    @@ -56,11 +57,8 @@ def generate_example() -> Tuple[List[Polygon], List[np.array]]:
     def find_close_polygons(
  polygon_subset: List[Polygon], point: np.array, max_dist: float
     ) -> List[Polygon]:
    -    close_polygons = []
    -    for poly in polygon_subset:
    -        if np.linalg.norm(poly.center - point) < max_dist:
    -            close_polygons.append(poly)
    -
    +    dist = np.linalg.norm([poly.center for poly in polygon_subset] - point, axis=1)
    +    close_polygons = list(compress(polygon_subset, dist < max_dist))
  return close_polygons

10x faster than the original, without resorting to native code, and without substantial increase in complexity of code base.

1 more reply

lmeyerov3y ago

People can write slow code in any language ;-) We've had to fire contractors who "wrote" dataframe code but was not vectorized in practice despite repeat request to do so. Same thing for slow code in CUDA.

From a maintenance view, I much prefer folks write vectorized data frames vs numpy or low level bindings, but that comes from having lived with the alternative for a lot longer. All of our exceptions are pretty much slotted for deletion. (Our fast path is single or multi-GPU dataframes in python.) Here's to hoping that one day we'll have dependent types in mypy!

yjftsjthsd-h3y ago

> People can write slow code in any language ;-)

That was a lot of my curiosity, because I'm not quite a good enough programmer to know whether the problem is just that the original code is bad or does something super inefficient, and so rewriting it any language will let you make massive improvements.

osmanbaskaya3y ago

I am really curious if there is an important reason why not trying this performance improvement with Cython first. Can someone comfortable with Cython explain what are the pros and cons doing this optimization with Cython?

1MachineElf3y ago

Python and Rust. Apple has some job openings in distributed systems for people fluent in both.

gjourdvhiokhf3y ago

Is Apple still hybrid only, no remote? Any links to some of their careers pages on this?

Piqued my curiosity as my background almost lines up, and I'm interested in that kind of role... But I live near one of their offices only hiring for skill sets I don't really have.

poulpy1233y ago

I'm wondering how much would be the speedup by rewriting the critical part in cython

jacquesm3y ago

It's not really Python that was sped up though, it was an application written in python augmented with a bit of optimized rust code.

This sort of hybrid is super common, you typically spend 90%+ of your time in computationally intensive problems in a very small subset of your code, typically the innermost loops. Optimizing those will have very good pay-off.

Traditionally we'd do this with high level stuff in one language and then assembly for the performance critical parts, these days it is more likely a combination of a scripting language for the high level part and a compiled language for the low level parts (C, rust, whatever). Java and such as less suitable for such optimization purposes, both because they come with a huge runtime and because they are hard to interface to other languages unless they happen to use the same underlying VM, but then there usually isn't much performance gain.

Another nice way to optimize computationally intensive code is by finding out if the code is suitable for adaptation to the GPU, which can give you many orders of magnitude speed improvement in some cases.

brandonpelfrey3y ago

Very cool! I can see myself using this soon actually :) On top of the "code speed up" this is a good problem for 2d data structures for performing this type of "find objects within radius" type of query.

orangepurple3y ago

It is an alternative framing of the N+1 problem (mistake) SQL users make

https://news.ycombinator.com/item?id=34207974

sandGorgon3y ago

>Also, using any JIT-based tricks (PyPy / numba) results in very small gains (as we will measure, just to make sure).

i wasnt able to see the numba comparison. anyone know how much worse it was ?

mattbillenstein3y ago

Kudos - this is a very nice blog post - "real" engineering ;)

kavalg3y ago

For this particular code, vectorization and some acceleration library, such as JAX, may be a better path to optimization. Otherwise an excellent article!

Heston3y ago

It's much easier and more accurate to time your python scripts with `time python script.py`. Cool write up.

masklinn3y ago

`time` is absolutely awful, with the minor exception of bsd’s time maybe.

If you’re going to benchmark scripts or executables, use hyperfine.

akasakahakada3y ago

Very nice. Thanks for sharing.

https://github.com/sharkdp/hyperfine

firechickenbird3y ago

The premise was to not rewrite everything in rust, but you basically ended up rewriting 90% of it in rust

smnrchrds3y ago

90% of the bottleneck, not 90% of their whole application. The author says that rewriting everything in Rust would have taken months, so the whole application must be huge.

"It is big and complex and very business critical and highly algorithmic, so that would take ~months of work, ..."

firechickenbird3y ago

OP fully rewrote the example program in rust, by also moving the entire data structures there. This would mean that any interaction with these ndarrays could be possible only on the rust side, hence any other code that uses them must be rewritten, unless there’s some porting of rust ndarrays to python numpy ndarrays

1 more reply

Animats3y ago

Using PyPy, which is a real compiler, might help.

That's doing spatial data processing by exaustive search, which is inherently slow. There are better algorithms. If the number of items to be searched is large, the spatial indices of MySQL could help.

why_only_153y ago

They tried PyPy at the beginning and it was 2x slower. Plausibly it would be better with additional optimization, but it's not cut and dry.

laerus3y ago

PyPy is a JIT compiler not a "real compiler", it requires warm up time to start optimizing code on runtime.

Animats3y ago

It does generate machine code, which CPython does not.

Alifatisk3y ago

In other words, PyPy is not a AOT-compiler.

rwalle3y ago

Did you read the article?

akasakahakada3y ago

The original code already look crap due to making a new list containing object instead of a mask or something else. Also that can be done using list comprehension. Also it totally can be vectorize or parallelize.

You need more experienced Python engineer.

tus6663y ago

> Rust (with the help of pyo3) unlocks true native performance for everyday Python code, with minimal compromises.

Hasn't he heard of ctypes? You can wrap C structs add Python objects since forever.

UncleEntity3y ago

There’s probably a way to get at the numpy objects without having to go through the python runtime and do all the computation in pure C.

I assume, haven’t really messed with numpy for anything but I can’t imagine it wouldn’t work that way.

isaacfrond3y ago

Title is terrible. Better: A Python code is 100x faster by rewriting small part in Rust

dang3y ago

We've taken the magic numbers out of the title now, as the site guidelines ask - https://news.ycombinator.com/newsguidelines.html

j / k navigate · click thread line to collapse

215 comments

okaleniuk3y ago

Good for you! You did everything right: measure always, fix the bottleneck if possible, rewrite if necessary.

masklinn3y ago

ZeroCool2u3y ago

The ignore the square root while computing/comparing distances trick is a great one. That's how I got to the top of the performance leaderboard in my first algorithms class.

appeldorian3y ago

He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.

tremon3y ago

Apache Arrow (https://arrow.apache.org/overview/) is built exactly around this idea: it's a library for managing the in-memory representation of large datasets.

ohr3y ago

Author here: I agree, that's a great perf advice (esp. when you can restructure your code).

I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.

One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.

The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.

Yoric3y ago

Having used this approach in a few languages, I agree that it's (sometimes) (much) better for performance, but it tends to wreak havoc on readability.

snicker73y ago

Languages and with some sort of decent metaprogramming support can alleviate this sort of issue (see Zig, Julia, Jai, etc.)

1 more reply

prirun3y ago

Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.

  [jim@mbp ~]$ sh -x x
  + cat x1.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x + p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x1 x1.c
  + ./x1
  s=1808348672

  real 0m12.078s
  user 0m7.319s
  sys 0m4.363s
  + ./x1
  s=1808348672

  real 0m9.415s
  user 0m6.677s
  sys 0m2.685s
  + cat x2.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      x[i] = i;
      y[i] = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += x[i] + y[i];
    }
    printf("s=%d\n", s);
  }
  + cc -o x2 x2.c
  + ./x2
  s=1808348672

  real 0m9.753s
  user 0m6.713s
  sys 0m2.967s
  + ./x2
  s=1808348672

  real 0m9.642s
  user 0m6.674s
  sys 0m2.902s
  + cat x3.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
    }
    for (i=0; i<NUM; i++) {
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x;
    }
    for (i=0; i<NUM; i++) {
      s += p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x3 x3.c
  + ./x3
  s=1808348672

  real 0m13.844s
  user 0m11.095s
  sys 0m2.700s
  + ./x3
  s=1808348672

  real 0m13.686s
  user 0m11.038s
  sys 0m2.611s
  + cat x4.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++)
      x[i] = i;
    for (i=0; i<NUM; i++)
      y[i] = i;

    s=0;
    for (i=0; i<NUM; i++)
      s += x[i];
    for (i=0; i<NUM; i++)
      s += y[i];

    printf("s=%d\n", s);
  }
  + cc -o x4 x4.c
  + ./x4
  s=1808348672

  real 0m13.530s
  user 0m10.851s
  sys 0m2.633s
  + ./x4
  s=1808348672

  real 0m13.489s
  user 0m10.856s
  sys 0m2.603s

lights01233y ago

It appears that you didn't enable optimizations. That's needed for SIMD, which can only be taken advantage of with contiguously packed data.

1 more reply

b0b101013y ago

This is a great article but there's still a core problem there - why should developers have to choose between accessibility and performance?

fbdab1033y ago

koito173y ago

>Today, there is a Python package for everything.

[1] https://github.com/phoboslab/jsmpeg

[2] https://github.com/brion/ogv.js/

3 more replies

el_oni3y ago

This is how I got into software development.

If i'd have stayed in in academia I would probably still be writing quick and dirty code and not worrying about the runtime because as a researcher there is always something else you can be doing.

onetonfoot3y ago

You can have your cake and eat it with the likes of

* PythonCall.jl - https://github.com/cjdoris/PythonCall.jl

* NodeCall.jl - https://github.com/sunoru/NodeCall.j

* RCall.jl - https://github.com/JuliaInterop/RCall.jl

I tend to use Julia for most things and then just dip into another language’s ecosystem if I can’t find something to do the job and it’s too complex to build myself

1 more reply

bakuninsbart3y ago

visarga3y ago

It also matters if you write code to run once or to serve in production, if it is experimental or stable.

If my script takes 3s to run and 5m to write in Python, vs 0.1s to run and 3h to write in C, I finish first with Python. I can try more ideas with Python.

Barrin923y ago

1 more reply

akasakahakada3y ago

Not even readability. Academic code is mostly unreadable. If you need example: IBM Qiskit.

Everything is just a prove of concept and no one expect anything more than that.

kaba03y ago

C is definitely not a good choice for this, I would hate to come back 2 days later to my computation and see “segfault” as the only output.

1 more reply

brahbrah3y ago

They would have gotten the same performance in python with numpy if they did it like this instead of calling norm for every polygon

centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)

They were just using numpy wrong. You can be slow in any language if you use the tools wrong

masklinn3y ago

You made this assertion multiple times, but so far it’s been entirely unsupported in fact, despite TFA having made the entire code set available for you to test your hypothesis on.

3 more replies

tayo423y ago

what is the difference?

though I do feel like i see this a lot with these kinds of "we re-wrote it in rust and everything is fast". comparing to a language with gc options often the scenario

1 more reply

ThouYS3y ago

everything. why are there still cobol programmers? why is c++ still the defacto native language (also in research)?

ubj3y ago

[1]: https://julialang.github.io/PackageCompiler.jl/dev/

[2]: https://github.com/tshort/StaticCompiler.jl

[3]: https://discourse.julialang.org/t/ann-the-ion-command-line-f...

[4]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[5]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[6]: https://www.reddit.com/r/Julia/comments/ytegfk/size_of_a_hel...

meepmorp3y ago

Thank you for settling a question for me - I was looking at julia's aot compilation abilities last week and the situation seemed like kind of a hassle.

shakow3y ago

IME, for having used Julia quite extensively in Academia:

- the development experience is hampered by the slow start time;

- the ecosystem is quite brittle;

- the promised performances are quite hard to actually reach, profiling only gets you so far;

- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)

> what's stopping people from switching???

All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.

ActorNightly3y ago

visarga3y ago

Julia doesn't get the latest models first, or have as big of a community.

korijn3y ago

pbowyer3y ago

What's a good guide to learn how to make (and see) vectorized code? It's a mindshift and not one I find easy.

appeldorian3y ago

akasakahakada3y ago

A more accruate keyword for googling is "SIMD". Single Instruction Multiple Data.

Numpy's tutorial for broadcasting is also a good starting point.

https://numpy.org/doc/stable/user/basics.broadcasting.html

korijn3y ago

The gist of it is that you give numpy two arrays, and what operation to apply. Then numpy will figure out what the for loop(s) should look like depending on the shape of the arrays.

You can look at various tutorials to see how it works. For example: https://jakevdp.github.io/PythonDataScienceHandbook/02.05-co...

rwalle3y ago

Exactly, that is the real issue, vectorization might be good enough in terms of performance. It doesn't seem to be mentioned in the article at all.

hn927268193y ago

It might have been added later, but the author mentions vectorization in the beginning:

1 more reply

FreeHugs3y ago

The most important part of the article seems to be that this Python code is taking "an avg of 293.41ms per iteration":

        def find_close_polygons(
            polygon_subset: List[Polygon], point: np.array, max_dist: float
        ) -> List[Polygon]:
            close_polygons = []
            for poly in polygon_subset:
                if np.linalg.norm(poly.center - point) < max_dist:
                    close_polygons.append(poly)

            return close_polygons

And after replacing it with this Rust code, it is taking "an avg of 23.44ms per iteration":

        use pyo3::prelude::*;
        use ndarray_linalg::Norm;
        use numpy::PyReadonlyArray1;

        #[pyfunction]
        fn find_close_polygons(
            py: Python<'_>,
            polygons: Vec<PyObject>,
            point: PyReadonlyArray1<f64>,
            max_dist: f64,
        ) -> PyResult<Vec<PyObject>> {
            let mut close_polygons = vec![];
            let point = point.as_array();
            for poly in polygons {
                let center = poly
                    .getattr(py, "center")?
                    .extract::<PyReadonlyArray1<f64>>(py)?
                    .as_array()
                    .to_owned();

                if (center - point).norm() < max_dist {
                    close_polygons.push(poly)
                }
            }

            Ok(close_polygons)
        }

Why is the Rust version 13x faster than the Python version?

spi3y ago

Yeah but the Python code is so bad that it's easy to get a 10x speedup using only numpy, as well. The current code essentially does:

    import numpy as np

    n_sides = 30
    n_polygons = 10000

    class Polygon:
        def __init__(self, x, y):
            self.x = x
            self.y = y
            self.center = np.array([self.x, self.y]).mean(axis=1)


    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        close_polygons = []
        for poly in polygon_subset:
            if np.linalg.norm(poly.center - point) < max_dist:
                close_polygons.append(poly)

        return close_polygons

    polygons = [Polygon(*np.random.rand(2, n_sides)) for _ in range(n_polygons)]
    point = np.array([0, 0])
    max_dist = 0.5

    %timeit find_close_polygons(polygons, point, max_dist)

    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        centers = np.array([polygon.center for polygon in polygon_subset])
        mask = np.linalg.norm(centers - point[None], axis=1) < max_dist
        return [
            polygon
            for polygon, is_pass in zip(polygon_subset, mask)
            if is_pass
        ]

then the same computation takes 4ms on my machine.

Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.

winrid3y ago

Python's for loop implementation is slow, also. You can use built in utils like map() which are "native" and can be a lot faster than a for loop with a push:

https://levelup.gitconnected.com/python-performance-showdown...

akasakahakada3y ago

Nope. Map() is same speed as for loop.

1 more reply

FreeHugs3y ago

I don't think it's the loop implementation. The stuff in the loop should take multiple orders of magnitude more time than the loop itself:

    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)

1 more reply

tweakimp3y ago

2 more replies

hannofcart3y ago

I was surprised that the Rust version is _only_ 13x as fast as the Python version.

IshKebab3y ago

Probably because it wasn't pure Python to start with.

lenkite3y ago

Shouldn't `close_polygons` be presized in both Python and Rust to avoid repeated allocation and copy ?

nickstinemates3y ago

One carries the entire feature set of the python runtime, the other is compiled.

FreeHugs3y ago

The time is spent in this 3-line loop:

    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)

I don't think the entire feature set of the Python runtime is involved in this.

1 more reply

hoseja3y ago

The final code takes just 2.90ms per iteration.

FreeHugs3y ago

The rest is not a fair comparison, because it rewrites the used libraries, not the application code.

You can always speed up an application if you rewrite the used libraries to match your specific use case.

2 more replies

ssivark3y ago

Making Python (near infinitely) faster by using it as a glue language, and running all the computation outside Python :-P

INTPenis3y ago

josephg3y ago

It sounds amazing, but bear in mind there are a lot of code which can’t be sped up like this because:

- Some code doesn’t have obvious optimization hotspots, and is instead just generally slow everywhere.

2 more replies

alex_sf3y ago

The more ML I do, the more disappointed I get.

4 more replies

internetter3y ago

za3faran3y ago

Then you give up the benefits of using a managed language, and you now have to maintain two stacks.

IMO/IME much better to go with a language where you don't have this dichotomy in the first place - e.g. Java or C#.

okaleniuk3y ago

Also, there is a faster Python which is also Python. And the author considered it as well (both PyPy and Numba), it's just in this particular scenario they were not the best way to go.

baq3y ago

This is basically what Python was first designed for and as evidenced by the article still excels at

flohofwoe3y ago

Well, that's exactly where Python works well. As scripting glue sitting between or above native code which does the heavy-lifting.

est3y ago

> and running all the computation outside Python

You don't have to move all the computation, just `for` loops will help alot.

majoe3y ago

I had a similar problem, when I was working as a PhD student a few years ago, where I needed to match the voxel representation of a 3D printer with the tetrahedral mesh of our rendering application.

My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.

hgomersall3y ago

Numba is great if you want to write a naive loop approach in python.

dunefox3y ago

Long startup time is relative. I believe it's much lower now than a couple of versions ago. 0.15s or so? Interop between python and rust will also take time.

xgdgsc3y ago

Julia 1.9 is fast. And you can use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia now. So I think after 1.9 release julia would be much more usable.

brahbrah3y ago

This was a silly and unnecessary optimization. He’s just using numpy wrong.

Instead of:

for p in ps: norm(p.center - point)

You should do:

centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)

You’ll get your same speed up in 2 lines without introducing a new dependency

_glass3y ago

Isn't this the version of refenced on the github repo [0] which speeds up 6x instead of 101x?

  There's also a "v1.5" version which is 6x faster, and uses "vectorizing" (doing more of the work directly in numpy). This version is much harder to optimize further.

[0] https://github.com/ohadravid/poly-match

brahbrah3y ago

No, their v1.5 is still calling norm on every polygon. They’re still using it wrong

On Google colab

    import numpy as np
    import time


    vals = np.random.randn(1000000, 2)
    point = np.array([.2, .3])
    s = time.time()
    for x in vals:
        np.linalg.norm(x - point) < 3
    a = time.time() - s

    s = time.time()
    np.linalg.norm(vals - point, axis=1) < 3
    b = time.time() - s

    print(a / b)

~296x faster, significantly faster than the solution in the article.

jerf3y ago

It's a feasible amount of knowledge to have, it's not like it's completely insane, but it's still rather a lot.

brahbrah3y ago

oblio3y ago

This is nice, how would you go about as a performance noob? I can't imagine there's a line in the docs saying "this is slow!".

akasakahakada3y ago

LineProfiler is the best tool to learn how to write performant Python and do code optimization.

https://github.com/pyutils/line_profiler

You can literally see the hot spot of your code, then you can grind different algorithms or change the whole architecture to make it faster.

For example replace short for loops to list comprehensions, vectorize all numpy operations (only vectorize partially do not help the issue), using 'not any()' instead or 'all()' for boolean, etc.

Doing this for like 2 weeks, basically you can automatically recognize most bad code patterns at a glance.

1 more reply

brahbrah3y ago

1 more reply

thanatropism3y ago

karamanolev3y ago

faitswulff3y ago

insanitybit3y ago

Obviously the answer is "because I'm faster with Python" but then part of my brain is "well get good idk".

pjmlp3y ago

JodieBenitez3y ago

> Also: does it ever make sense to write something like a CRUD+ backend in Rust?

Well... given 2 frameworks equally pleasant to work with, why not using the one with the best performance ? (Whatever performance means... less cpu ? less memory ? better i/o ?).

As a Django user, working on problems where Django shines, I have yet to see such a solution in Rust, but that doesn't mean it won't happen one day.

rhodysurf3y ago

HacklesRaised3y ago

Whatever performance means is really interesting. One of the few unquestionable benefits of cloud computing is that it is now easier than ever to measure the 'running' cost of a program / solution.

At what point does the financial impact influence our technology strategy in a way that perhaps our otherwise virtuous proseltysing about the need to reduce our carbon footprint doesn't?

Sorry to ramble on!!

karussell3y ago

I wonder why GraalVM is not more often used for these speed critical cases: https://www.graalvm.org/python/ (Same for ruby https://www.graalvm.org/ruby/)

Is the problem the Oracle involvement? Or is it not that fast as advertised or problems with the ecosystem (C libraries)?

thisgoodlife3y ago

My concern is that it’s not ready for prod yet.

“At this point, the Python runtime is made available for experimentation and curious end-users. “

https://www.graalvm.org/latest/reference-manual/python/

dunefox3y ago

I would not use anything made by oracle if I have the choice. My team sees it similarly.

pjmlp3y ago

Better clean up the Linux kernel from Oracle contributions then, in case you are using it.

https://lwn.net/Articles/915435/

1 more reply

cornholio3y ago

Language design request:

Take from Python the simple syntax, default pass by reference of all non-numeric types, simplified string handling, unified slice and array syntax - and any other simplifying feature possible.

machiaweliczny3y ago

I think Nim is closest to this wishlist

elcritch3y ago

za3faran3y ago

Modern Java and modern C# should be closer to what you described.

On a side note, python is strictly pass by value. For non-primitives, their references are passed by value.

mijoharas3y ago

Sounds like swift to me

jcolella3y ago

This is possibly one of the best written articles end-to-end I have read. Excellent job telling the story

brahbrah3y ago

tecleandor3y ago

Thanks!

1 more reply

Dowwie3y ago

Elixir is made faster with Rust, too. Rust is a great skillset to have for those measured moments.

saeranv3y ago

``` centers = get_centers(polgons) # M x 3 array close_idx = np.where( np.linalg.norm(centers - point, axis=1) < max_dist)[0] close_polygons = polygons[close_idx,:] ```

toxik3y ago

Agreed, I speed up Python numpy code with numba quite often and it isn’t at all unreadable to put it in an ndarray subclass.

    poly = Polygon(vertices)

I would bet you can achieve just as much of a speedup with numba or Cython using this form.

swyx3y ago

in the domain of Python tooling made faster with Rust, check out https://github.com/charliermarsh/ruff which is 10-100x faster than pylint etc

moreresearchplz3y ago

Like to see comparison to when they use a spatial index like strtree or rtree.

crabbone3y ago

> Python is a superb API for researchers,

Where does this nonsense come from? No. It isn't. It's just a stupid fashion. Something that should be discouraged, not encouraged by trying to make it work when it's obviously broken.

ActorNightly3y ago

>Python is just a bad language. Popular, but awful.

crabbone3y ago

No, you couldn't be more wrong. Nice argument you dug up there!

Language going forward... I hadn't had such a laugh in a while... It was dead in the 70s. Long before it was born. Are you under influence or something?

> There is a reason why bleeding edge ML stuff is done through Python

1 more reply

pahbloo3y ago

What are the best alternatives to Python then?

crabbone3y ago

For research? -- Julia seems to be definitely better. It's purpose-built for doing just that. R would also be there. If you want general statistics, then add J to the fold.

2 more replies

xgdgsc3y ago

Rust is verbose. I would use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia and usually get similar performance.

shakow3y ago

Does it compile AoT or are you stuck with Julia start time?

wdroz3y ago

One of the others understated pros of rewriting some parts in Rust, it's that you can parallelize easily with Rayon[0]

[0] -- https://github.com/rayon-rs/rayon

wcrossbow3y ago

flohofwoe3y ago

(e.g. it doesn't need to be "as fast as possible", just fast enough to no longer be a workflow bottleneck)

wcrossbow3y ago

1 more reply

wanderingmind3y ago

Why do I need to rewrite in Rust, when I can just use Polars[1] that will cover most usecases [1] https://www.pola.rs/

masklinn3y ago

What is the relation? Polars replaces pandas, not numpy.

wenc3y ago

Polars and DuckDB have replaced Pandas for me for the most part.

yjftsjthsd-h3y ago

> The library was already using numpy for a lot of its calculations, so why should we expect Rust to be better?

alex_smart3y ago

Just that small change already gives 10x faster performance without ever leaving python/numpy land.

akasakahakada3y ago

> They wanted to calculated the distances of n points against a given point and find out which points are closer than a threshold (max_dist).

Scipy should have already implemented such thing. Scikit-Learn also. Because KNN clustering is exactly doing this kind of work.

fbdab1033y ago

They did have a choice quote a bit before the blurb you quoted. The real world problem is sufficiently more complex/different enough that vectorizing it would be a pain.

yjftsjthsd-h3y ago

Thanks, I managed to miss that

anewhnaccount23y ago

biomcgary3y ago

alex_smart3y ago

    diff --git a/poly_match_v1.py b/poly_match_v1.py
    index 675c88a..4293a46 100644
    --- a/poly_match_v1.py
    +++ b/poly_match_v1.py
    @@ -1,4 +1,5 @@
     from functools import cached_property
    +from itertools import compress
     from typing import List, Tuple
     import numpy as np
     from dataclasses import dataclass
    @@ -56,11 +57,8 @@ def generate_example() -> Tuple[List[Polygon], List[np.array]]:
     def find_close_polygons(
  polygon_subset: List[Polygon], point: np.array, max_dist: float
     ) -> List[Polygon]:
    -    close_polygons = []
    -    for poly in polygon_subset:
    -        if np.linalg.norm(poly.center - point) < max_dist:
    -            close_polygons.append(poly)
    -
    +    dist = np.linalg.norm([poly.center for poly in polygon_subset] - point, axis=1)
    +    close_polygons = list(compress(polygon_subset, dist < max_dist))
  return close_polygons

10x faster than the original, without resorting to native code, and without substantial increase in complexity of code base.

1 more reply

lmeyerov3y ago

yjftsjthsd-h3y ago

> People can write slow code in any language ;-)

osmanbaskaya3y ago

1MachineElf3y ago

Python and Rust. Apple has some job openings in distributed systems for people fluent in both.

gjourdvhiokhf3y ago

Is Apple still hybrid only, no remote? Any links to some of their careers pages on this?

Piqued my curiosity as my background almost lines up, and I'm interested in that kind of role... But I live near one of their offices only hiring for skill sets I don't really have.

poulpy1233y ago

I'm wondering how much would be the speedup by rewriting the critical part in cython

jacquesm3y ago

It's not really Python that was sped up though, it was an application written in python augmented with a bit of optimized rust code.

brandonpelfrey3y ago

orangepurple3y ago

It is an alternative framing of the N+1 problem (mistake) SQL users make

https://news.ycombinator.com/item?id=34207974

sandGorgon3y ago

>Also, using any JIT-based tricks (PyPy / numba) results in very small gains (as we will measure, just to make sure).

i wasnt able to see the numba comparison. anyone know how much worse it was ?

mattbillenstein3y ago

Kudos - this is a very nice blog post - "real" engineering ;)

kavalg3y ago

For this particular code, vectorization and some acceleration library, such as JAX, may be a better path to optimization. Otherwise an excellent article!

Heston3y ago

It's much easier and more accurate to time your python scripts with `time python script.py`. Cool write up.

masklinn3y ago

`time` is absolutely awful, with the minor exception of bsd’s time maybe.

If you’re going to benchmark scripts or executables, use hyperfine.

akasakahakada3y ago

Very nice. Thanks for sharing.

https://github.com/sharkdp/hyperfine

firechickenbird3y ago

The premise was to not rewrite everything in rust, but you basically ended up rewriting 90% of it in rust

smnrchrds3y ago

90% of the bottleneck, not 90% of their whole application. The author says that rewriting everything in Rust would have taken months, so the whole application must be huge.

"It is big and complex and very business critical and highly algorithmic, so that would take ~months of work, ..."

firechickenbird3y ago

1 more reply

Animats3y ago

Using PyPy, which is a real compiler, might help.

why_only_153y ago

They tried PyPy at the beginning and it was 2x slower. Plausibly it would be better with additional optimization, but it's not cut and dry.

laerus3y ago

PyPy is a JIT compiler not a "real compiler", it requires warm up time to start optimizing code on runtime.

Animats3y ago

It does generate machine code, which CPython does not.

Alifatisk3y ago

In other words, PyPy is not a AOT-compiler.

rwalle3y ago

Did you read the article?

akasakahakada3y ago

You need more experienced Python engineer.

tus6663y ago

> Rust (with the help of pyo3) unlocks true native performance for everyday Python code, with minimal compromises.

Hasn't he heard of ctypes? You can wrap C structs add Python objects since forever.

UncleEntity3y ago

There’s probably a way to get at the numpy objects without having to go through the python runtime and do all the computation in pure C.

I assume, haven’t really messed with numpy for anything but I can’t imagine it wouldn’t work that way.

isaacfrond3y ago

Title is terrible. Better: A Python code is 100x faster by rewriting small part in Rust

dang3y ago

We've taken the magic numbers out of the title now, as the site guidelines ask - https://news.ycombinator.com/newsguidelines.html

j / k navigate · click thread line to collapse