And yet here we are again. Shouldn't this be part of some timing testsuite of CPU vendors by now?
During dynamic linking, glibc picks a memcpy implementation which seems most appropriate for the current machine. We have about 13 different implementations just for x86-64. We could add another one for current(ish) AMD CPUs, select a different existing implementation for them, or change the default for a configurable cutover point in a parameterized implementation.
More broadly compatible routines will still work on newer CPUs, they just won yield the best performance.
It still would be nice if such central routines could just be compiled to the REP-prefixed instructions and would deliver (near-)optimal performance so we could stop worrying about that particular part.
I'm not surprised the conclusion had something to do with the way that native code works. Admittedly I was surprised at the specific answer - still a very interesting article despite the confusing start.
Edit: The conclusion also took me a couple of attempts to parse. There's a heading "C is slower than Python with specified offset". To me, as a native English speaker, this reads as "C is slower (than Python) with specified offset" i.e. it sounds like they took the C code, specified the same offset as Python, and then it's still slower than Python. But it's the opposite: once the offset from Python was also specified in the C code, the C code was then faster. Still very interesting once I got what they were saying though.
It's surprising that something as simple as reading a file is slower in the Rust standard library as the Python standard library. Even knowing that a Python standard library call like this is written in C, you'd still expect the Rust standard library call to be of a similar speed; so you'd expect either that you're using it wrong, or that the Rust standard library has some weird behavior.
In this case, it turns out that neither were the case; there's just a weird hardware performance cliff based on the exact alignment of an allocation on particular hardware.
So, yeah, I'd expect a filesystem read to be pretty well optimized in Python, but I'd expect the same in Rust, so it's surprising that the latter was so much slower, and especially surprising that it turned out to be hardware and allocator dependent.
If I write Python and my code is fast, to me that sounds like Python is fast, I couldn't care less whether it's because the implementation is in another language or for some other reason.
When you see an interpreted language faster than a compiled one, it's worth looking at why, because most the time it's because there's some hidden issue causing the other to be slow (which could just be a different and much worse implementation).
Put another way, you can do a lot to make a Honda Civic very fast, but when you hear one goes up against a Ferrari and wins your first thoughts should be about what the test was, how the Civic was modified, and if the Ferrari had problems or the test wasn't to its strengths at all. If you just think "yeah, I love Civics, that's awesome" then you're not thinking critically enough about it.
For me, coding is almost exclusively using python libraries like numpy to call out to other languages like c or FORTRAN. It feels silly to say I'm not coding in Python to me.
On the other hand, if you're writing those libraries, coding to you is mostly writing FORTRAN and c optimizations. It probably feels silly to say you're coding in Python just because that's where your code is called from.
It's completely fair to say that's not python because it isn't - any language out there can FFI to C and it has the same problems mentioned above.
Pretty much any language can wrap C/Rust code.
Why does it matter?
1. Having to split your code across 2 languages via FFI is a huge pain.
2. You are still writing some Python. There's plenty of code that is pure Python. That code is slow.
Also, when we talk about "faster" and "slower," it's not clear the order of magnitude.
Maybe an analysis of actual code execution would shed more light than a simplistic explanation that the Python interpreter is written in C. I don't think the BASIC interpreter in my first computer was written in BASIC.
What's there to understand? When it's fast it's not really Python, it's C. C is fast. Python can call out to C. You don't have to care that the implementation is in another language, but it is.
99% of my use cases are easily, maintainably solved with good, modern Python. The Python execution is almost never the bottleneck in my workflows. It’s disk or network I/O.
I’m not against building better languages and ecosystems, and compiled languages are clearly appropriate/required in many workflows, but the language parochialism gets old. I just want to build shit that works and get stuff done.
Now why would you expect that?
What happened to OP is a pure chance. CPython's C code doesn't even care about const-consistency. It's flush with dynamic memory allocations, bunch of helper / convenience calls... Even stuff like arithmetic does dynamic memory allocation...
Normally, you don't expect CPython to perform well, not if you have any experience working with it. Whenever you want to improve performance you want to sidestep all the functionality available there.
Also, while Python doesn't have a standard library, since it doesn't have a standard... the library that's distributed with it is mostly written in Python. Of course, some of it comes written in C, but there's also a sizable fraction of that C code that's essentially Python code translated mechanically into C (a good example of this is Python's binary search implementation which was originally written in Python, and later translated into C using Python's C API).
What one would expect is that functionality that is simple to map to operating system functionality has a relatively thin wrapper. I.e. reading files wouldn't require much in terms of binding code because, essentially, it goes straight into the system interface.
I have, several, and it's far from trivial.
The basics are seriously optimized for typical use cases, take a look at the source code for the dict type.
On the other hand… so what? It’s kind of fun.
However I am more interested/concerned about another part. How the issue is reported/recorded and how the communications are handled.
Reporting is done over discord, which is a proprietary environment which is not indexed, or searchable. Will not be archived.
Communications and deliberations are done over discord and telegram, which is probably worse than discord in this context.
This blog post and the github repository is the lingering remains of them. If Xuanwo did not blog this. It would be lost in timeline.
Isn't this fascinating?
You can provide public log of them not because they are not proprietary, but that they have API to allow logging. Telegram also has such API, and FWIW our discussion group does have searchable log that you can access here: https://luoxu-web.vercel.app/#g=1264662201 It is not indexable publicly more for privacy concern, again not because the platform is proprietary.
Only thing that makes this bug and the process of the debug visible is this blog post.
Another point is I don't think IRC or any instant messaging app is the correct place for this kinds of discussions. Unless important points are logged to some bug reporting tool, or perhaps a mailing list, or to a blog post like this one, they are useless for historic purposes.
That's why I don't accept the response "but there's Discord now" whenever I moan about USENET's demise. Back in the days before it, every post was nicely searchable by DejaNews (later Google).
We need to get back to open standards for important communications (e.g. all open source projects that are important to the Internet/WWW stack and core programming and libraries).
The accepted fix would not be trivial to anyone not already experienced with the kernel. But more important, it obviously isn’t obvious what is the right way to enable the workaround. The best way is to probably measure at boot time, otherwise how do you know which models and steppings are affected.
If the vendor won't patch it, then a workaround is the next best thing. There shouldn't be many - that's why all copying code is in just a handful of functions.
https://internals.rust-lang.org/t/jemalloc-was-just-removed-...
I am curious if this is something that everyone can do to get free performance or if there are caveats. Can C codebases benefit from this too? Is this performance that is simply left on table currently?
* https://github.com/jemalloc/jemalloc/issues/387#issuecomment...
* https://gitlab.haskell.org/ghc/ghc/-/issues/17411
Apparently now `jemalloc` will call `MADV_DONTNEED` 10 seconds after `MADV_FREE`: https://github.com/JuliaLang/julia/issues/51086#issuecomment...
So while this "fixes" the issue, it'll introduce a confusing time delay between you freeing the memory and you observing that in `htop`.
But according to https://jemalloc.net/jemalloc.3.html you can set `opt.muzzy_decay_ms = 0` to remove the delay.
Still, the musl author has some reservations against making `jemalloc` the default:
https://www.openwall.com/lists/musl/2018/04/23/2
> It's got serious bloat problems, problems with undermining ASLR, and is optimized pretty much only for being as fast as possible without caring how much memory you use.
With the above-mentioned tunables, this should be mitigated to some extent, but the general "theme" (focusing on e.g. performance vs memory usage) will likely still mean "it's a tradeoff" or "it's no tradeoff, but only if you set tunables to what you need".
Example of this: https://github.com/prestodb/presto/issues/8993
And this is not a one-off: https://hackernoon.com/reducing-rails-memory-use-on-amazon-l... https://engineering.linkedin.com/blog/2021/taming-memory-fra...
jemalloc also has extensive observability / debugging capabilities, which can provide a useful global view of the system, it's been used to debug memleaks in JNI-bridge code: https://www.evanjones.ca/java-native-leak-bug.html https://technology.blog.gov.uk/2015/12/11/using-jemalloc-to-...
If you want to gauge whether your system is memory-limited look at the PSI metrics instead.
Rust used to use jemalloc by default but switched as people found this surprising as the default.
It turns out jemalloc isn't always best for every workload and use case. While the system allocator is often far from perfect, it at least has been widely tested as a general-purpose allocator.
does tend to use more ram tho
> With the new Zen3 CPUs, Fast Short REP MOV (FSRM) is finally added to AMD’s CPU functions analog to Intel’s X86_FEATURE_FSRM. Intel had already introduced this in 2017 with the Ice Lake Client microarchitecture. But now AMD is obviously using this feature to increase the performance of REP MOVSB for short and very short operations. This improvement applies to Intel for string lengths between 1 and 128 bytes and one can assume that AMD’s implementation will look the same for compatibility reasons.
https://www.igorslab.de/en/cracks-on-the-core-3-yet-the-5-gh...
Note that for rep store to be better it must overcome the cost of the initial latency and then catch up to the 32byte vector copies, which yes generally have not-as-good-perf vs DRAM speed, but they aren't that bad either. Thus for small copies.... just don't use string store.
All this is not even considering non-temporal loads/stores; many larger copies would see better perf by not trashing the L2 cache, since the destination or source is often not inspected right after. String stores don't have a non-temporal option, so this has to be done with vectors.
with open('myfile') as f:
data = f.read()
I'm not much of a C programmer myself. but I at least reported part of the issue to Python: https://bugs.python.org/issue45944This is the fastest way to read a file on python that I've found, using only 3-4 syscalls (though os.fstat() doesn't work for some special files kernel files like those in /proc/ and /dev/):
def read_file(path: str, size=-1) -> bytes:
fd = os.open(path, os.O_RDONLY)
try:
if size == -1:
size = os.fstat(fd).st_size
return os.read(fd, size)
finally:
os.close(fd)Maybe I don’t need to query the file size at all?
Having a hook to get people to want to read the article is reasonable in my opinion; after all, if you could fit every detail in the size of a headline, you wouldn't need an article at all! Clickbait inverts this by _only_ having enough enough substance that you could get all the info in the headline, but instead it leaves out the one detail that's interesting and then pads it with fluff that you're forced to click and read through if you want the answer.
> In conclusion, the issue isn't software-related. Python outperforms C/Rust due to an AMD CPU bug.
Slack is allocating 1132 GB of virtual memory on my laptop right now. I don't know if they are using mmap but that's 1100 GB more than the physical memory.
seems its not without perils on Windows:
"In an ideal world, that would be all we have to say about the new solution. But for Windows users, there's a special quirk. On most operating systems, we can use a special flag to signal that we don't really care if the system has 32 GiB of real memory. Unfortunately, Windows has no convenient way to do this. Dolphin still works fine on Windows computers that have less than 32 GiB of RAM, but if Windows is set to automatically manage the size of the page file, which is the case by default, starting any game in Dolphin will cause the page file to balloon in size. Dolphin isn't actually writing to all this newly allocated space in the page file, so there are no concerns about performance or disk lifetime. Also, Windows won't try to grow the page file beyond the amount of available disk space, and the page file shrinks back to its previous size when you close Dolphin, so for the most part there are no real consequences... "
I'm impressed by your perseverance, how you follow through with your investigation to the lowest (hardware) level.
If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.
Rather, the performance issue only occurs when using `rep movsb` on AMD CPUs with certain page/data alignment.
Pymalloc just happens to be using page/data alignment that makes `rep movsb` happy while Rust's default allocator is using alignments that just happen to make `rep movsb` sad.
This has nothing to do with python or rust
>...
>Python features three memory domains, each representing different allocation strategies and optimized for various purposes.
>...
>Rust is slower than Python only on my machine.
if one library performs wildly better than the other in the same test, on the same hardware, how can that not be a software-related problem? sounds like a contradiction.
Maybe should be considered a coding issue and/or feature absent? IMHO it would be expected Rust's std library perform well without making all the users to circumvent the issue manually.
The article is well investigated so I assume the author just want to show the problem existence without creating controversy because other way I can not understand.
But since python runtime is written in C, the issue can't be Python vs C.
Not too long ago I read in Intel's optimization guidelines that rep was now faster again and should be used.
Seems most of these things needs to be benchmarked on the CPU, as they change "all the time". I've sped up plenty of code by just replacing hand crafted assembly with high-level functional equivalent code.
Of course so-slow-it's-bad is different, however a runtime-determined implementation choice would avoid that as well.
Whenever you're writing performance-critical software, you need to consider the relevant combinations of hardware + software + workload + configuration.
Sometimes a problem can be created or fixed by adjusting any one / some subset of those details.
Maybe using an alternative allocator only solves the problem by accident and there's another way to solve it intentionally; I don't yet fully understand the problem. My point is that using a different allocator by default was already tried.
I've honestly never worked in a domain where binary size ever really mattered beyond maybe invoking `strip` on a binary before deploying it, so I try to keep an open mind. That said, this has always been a topic of discussion around Rust[0], and while I obviously don't have anything against binary sizes being smaller, bugs like this do make me wonder about huge changes like switching the default allocator where we can't really test all of the potential side effects; next time, the unintended consequences might not be worth the tradeoff.
[0]: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...