Rust std fs slower than Python? No, it's hardware (opens in new tab)

(xuanwo.io)

687 pointsPop_-2y ago240 comments

240 comments

There are two dedicated CPU feature flags to indicate that REP STOS/MOV are fast and usable as short instruction sequence for memset/memcpy. Having to hand-roll optimized routines for each new CPU generation has been an ongoing pain for decades.

And yet here we are again. Shouldn't this be part of some timing testsuite of CPU vendors by now?

gpderetta2y ago

I'm completely making stuff up here, but I wonder if this is the effect of some last minute (or even post-release, via ucode update) bug fix, where page aligned fast rep movs had issues or were subject to some attack and got disabled.

renonce2y ago

Then fast rep movs should have been disabled in cpuid altogether

1 more reply

giancarlostoro2y ago

So correct me if I am wrong but does this mean you need to compile two executables for a specific compile time build? Or is it just you need to compile it from specific hardware? Wondering what the fix would be, some sort of runtime check?

fweimer2y ago

The exact nature of the fix is unclear at present.

During dynamic linking, glibc picks a memcpy implementation which seems most appropriate for the current machine. We have about 13 different implementations just for x86-64. We could add another one for current(ish) AMD CPUs, select a different existing implementation for them, or change the default for a configurable cutover point in a parameterized implementation.

1 more reply

the84722y ago

The sibling comments mention the hardware specific dynamic linking in glibc that's used for function calls. But if your compiler inlines memcpy (usually for short, fixed-sized copies) into the binary then yes you'll have to compile it for a specific CPU to get optimal performance. But that's true for all target-dependent optimizations.

More broadly compatible routines will still work on newer CPUs, they just won yield the best performance.

It still would be nice if such central routines could just be compiled to the REP-prefixed instructions and would deliver (near-)optimal performance so we could stop worrying about that particular part.

2 more replies

dralley2y ago

Glibc supports runtime selection of different optimized paths, yes. There was a recent discussion about a security vulnerability in that feature (discussion https://news.ycombinator.com/item?id=37756357), but in essence this is exactly the kind of thing it's useful for.

ww5202y ago

Since the CPU instructions are the same, instruction patching at startup or install time can be used. Just patch in the correct instructions for the respective hardware.

1 more reply

immibis2y ago

glibc has the ability to dynamically link a different version of a function based on the CPU.

mike_hock2y ago

You'd think the CPU vendor knows their CPU best. If there's a faster "software" implementation, why doesn't REP MOVS at least do the same thing in microcode?

Aissen2y ago

Associated glibc bug (Zen 4 though): https://sourceware.org/bugzilla/show_bug.cgi?id=30994

fweimer2y ago

And AMD is investigating: https://inbox.sourceware.org/libc-alpha/20231115190559.29112...

Arnavion2y ago

The bug is also about Zen 3, and even mentions the 5900X (the article author's CPU).

nabakin2y ago

If you read the bug tracker, a comment mentions this affects Zen 3 and Zen 4

royjacobs2y ago

I was prepared to read the article and scoff at the author's misuse of std::fs. However, the article is a delightful succession of rabbit holes and mysteries. Well written and very interesting!

bri3d2y ago

This was such a good article! The debugging was smart (writing test programs to peel each layer off), the conclusion was fascinating and unexpected, and the writing was clear and easy to follow.

quietbritishjim2y ago

I'm a bit confused about the premise. This is not comparing pure Python code against some native (C or Rust) code. It's comparing one Python wrapper around native code (Python's file read method) against another Python wrapper around some native code (OpenDAL). OK it's still interesting that there's a difference in performance, but it's very odd to describe it as "slower than Python". Did they expect that the Python standard library is all written in pure Python? On the contrary, I would expect the implementations of functions in Python's standard library to be native and, individually, highly optimised.

I'm not surprised the conclusion had something to do with the way that native code works. Admittedly I was surprised at the specific answer - still a very interesting article despite the confusing start.

Edit: The conclusion also took me a couple of attempts to parse. There's a heading "C is slower than Python with specified offset". To me, as a native English speaker, this reads as "C is slower (than Python) with specified offset" i.e. it sounds like they took the C code, specified the same offset as Python, and then it's still slower than Python. But it's the opposite: once the offset from Python was also specified in the C code, the C code was then faster. Still very interesting once I got what they were saying though.

lambda2y ago

I'm a bit confused by why you are confused.

It's surprising that something as simple as reading a file is slower in the Rust standard library as the Python standard library. Even knowing that a Python standard library call like this is written in C, you'd still expect the Rust standard library call to be of a similar speed; so you'd expect either that you're using it wrong, or that the Rust standard library has some weird behavior.

In this case, it turns out that neither were the case; there's just a weird hardware performance cliff based on the exact alignment of an allocation on particular hardware.

So, yeah, I'd expect a filesystem read to be pretty well optimized in Python, but I'd expect the same in Rust, so it's surprising that the latter was so much slower, and especially surprising that it turned out to be hardware and allocator dependent.

quietbritishjim2y ago

It's just the spin of it that threw me off. You're right: "why is a C implementation so much faster than a widely used Rust implementation" is a valid and interesting question. But phrasing it as "why is a Python function faster than a Rust function", when it's clearly not the comparison at all, distracts from the real question.

qd0112y ago

I don't understand why Python gets shit for being a slow language when it's slow but no credit for being fast when it's fast just because "it's not really Python".

If I write Python and my code is fast, to me that sounds like Python is fast, I couldn't care less whether it's because the implementation is in another language or for some other reason.

kbenson2y ago

Because for any nontrivial case you would expect python+compiled library and associated marshaling of data to be slower than that library in its native implementation without any inyerop/marshaling required.

When you see an interpreted language faster than a compiled one, it's worth looking at why, because most the time it's because there's some hidden issue causing the other to be slow (which could just be a different and much worse implementation).

Put another way, you can do a lot to make a Honda Civic very fast, but when you hear one goes up against a Ferrari and wins your first thoughts should be about what the test was, how the Civic was modified, and if the Ferrari had problems or the test wasn't to its strengths at all. If you just think "yeah, I love Civics, that's awesome" then you're not thinking critically enough about it.

3 more replies

benrutter2y ago

I wonder if its because we're sometimes talking cross purposes.

For me, coding is almost exclusively using python libraries like numpy to call out to other languages like c or FORTRAN. It feels silly to say I'm not coding in Python to me.

On the other hand, if you're writing those libraries, coding to you is mostly writing FORTRAN and c optimizations. It probably feels silly to say you're coding in Python just because that's where your code is called from.

1 more reply

rafaelmn2y ago

But you will care if that "python" breaks - you get to drop down to C/C++ and debugging native code. Likewise for adding features or understanding the implementation. Not to mention having to deal with native build tooling and platform specific stuff.

It's completely fair to say that's not python because it isn't - any language out there can FFI to C and it has the same problems mentioned above.

IshKebab2y ago

Because when people talk about Python performance they're talking about the performance of Python code itself, not C/Rust code that it's wrapping.

Pretty much any language can wrap C/Rust code.

Why does it matter?

1. Having to split your code across 2 languages via FFI is a huge pain.

2. You are still writing some Python. There's plenty of code that is pure Python. That code is slow.

1 more reply

afdbcreid2y ago

Usually, yes, but when it's a bug in the hardware, it's not really that Python is fast, more like that CPython developers were lucky enough to not have the bug.

1 more reply

analog312y ago

I think the confusion comes from people not having a good understanding of what an interpreted programming language does, and what actual portion of time is spent in high versus low level code. I've always assumed that most of my programs amount to a bit of glue thrown in between system calls.

Also, when we talk about "faster" and "slower," it's not clear the order of magnitude.

Maybe an analysis of actual code execution would shed more light than a simplistic explanation that the Python interpreter is written in C. I don't think the BASIC interpreter in my first computer was written in BASIC.

1 more reply

insanitybit2y ago

>I don't understand why Python gets shit for being a slow language when it's slow but no credit for being fast when it's fast just because "it's not really Python".

What's there to understand? When it's fast it's not really Python, it's C. C is fast. Python can call out to C. You don't have to care that the implementation is in another language, but it is.

p5a0u9l2y ago

I constantly get low key shade for choosing to build everything in Python. It’s really interesting to me. People can’t break out of thinking, “oh, you wrote a script for that?”. Actually, no, it’s software, not a script.

99% of my use cases are easily, maintainably solved with good, modern Python. The Python execution is almost never the bottleneck in my workflows. It’s disk or network I/O.

I’m not against building better languages and ecosystems, and compiled languages are clearly appropriate/required in many workflows, but the language parochialism gets old. I just want to build shit that works and get stuff done.

paulddraper2y ago

Yeah, it's weird.

crabbone2y ago

> individually, highly optimised.

Now why would you expect that?

What happened to OP is a pure chance. CPython's C code doesn't even care about const-consistency. It's flush with dynamic memory allocations, bunch of helper / convenience calls... Even stuff like arithmetic does dynamic memory allocation...

Normally, you don't expect CPython to perform well, not if you have any experience working with it. Whenever you want to improve performance you want to sidestep all the functionality available there.

Also, while Python doesn't have a standard library, since it doesn't have a standard... the library that's distributed with it is mostly written in Python. Of course, some of it comes written in C, but there's also a sizable fraction of that C code that's essentially Python code translated mechanically into C (a good example of this is Python's binary search implementation which was originally written in Python, and later translated into C using Python's C API).

What one would expect is that functionality that is simple to map to operating system functionality has a relatively thin wrapper. I.e. reading files wouldn't require much in terms of binding code because, essentially, it goes straight into the system interface.

codr72y ago

Have you ever attempted to write a scripting language that performs better?

I have, several, and it's far from trivial.

The basics are seriously optimized for typical use cases, take a look at the source code for the dict type.

4 more replies

xuanwo2y ago

Thanks for the comments. I have fixed the headers :)

fl0ki2y ago

The premise is that any time you say "Python [...] faster than Rust [...]" you get page views even if it's not true. People have noticed after the last few dozen times something like this was posted.

p5a0u9l2y ago

This is the answer. The thread is chasing various smart-people opinions about languages, interpreters, system calls. We got tricked into click bait title and are using the opportunity to rehash our favorite topics and biases.

On the other hand… so what? It’s kind of fun.

fsniper2y ago

The article itself is a great read and it has fascinating info related to this issue.

However I am more interested/concerned about another part. How the issue is reported/recorded and how the communications are handled.

Reporting is done over discord, which is a proprietary environment which is not indexed, or searchable. Will not be archived.

Communications and deliberations are done over discord and telegram, which is probably worse than discord in this context.

This blog post and the github repository is the lingering remains of them. If Xuanwo did not blog this. It would be lost in timeline.

Isn't this fascinating?

upsuper2y ago

Yes, they are proprietary, which is not great. But I don't buy the allegation that they are not indexed or searchable. There are very few IMs that provide builtin publicly accessable log indexed or searchable by default. Does every IRC server come with public log? What about Matrix groups? How do discussion there not get lost in timeline?

You can provide public log of them not because they are not proprietary, but that they have API to allow logging. Telegram also has such API, and FWIW our discussion group does have searchable log that you can access here: https://luoxu-web.vercel.app/#g=1264662201 It is not indexable publicly more for privacy concern, again not because the platform is proprietary.

fsniper2y ago

This is not a way to have bug discussions, or record them. Do you really think I could find this information on a search for a similar issue?

Only thing that makes this bug and the process of the debug visible is this blog post.

Another point is I don't think IRC or any instant messaging app is the correct place for this kinds of discussions. Unless important points are logged to some bug reporting tool, or perhaps a mailing list, or to a blog post like this one, they are useless for historic purposes.

1 more reply

jll292y ago

> Reporting is done over discord, which is a proprietary environment which is not indexed, or searchable. Will not be archived.

That's why I don't accept the response "but there's Discord now" whenever I moan about USENET's demise. Back in the days before it, every post was nicely searchable by DejaNews (later Google).

We need to get back to open standards for important communications (e.g. all open source projects that are important to the Internet/WWW stack and core programming and libraries).

iampims2y ago

Most interesting article I've read this week. Excellent write-up.

londons_explore2y ago

So the obvious thing to do... Send a patch to change the "copy_user_generic" kernel method to use a different memory copying implementation when the CPU is detected to be a bad one and the memory alignment is one that triggers the slowness bug...

p3n1s2y ago

Not obvious. Seems like if it can be corrected with microcode just have people use updated microcode rather than litter the kernel with fixes that are effectively patchable software problems.

The accepted fix would not be trivial to anyone not already experienced with the kernel. But more important, it obviously isn’t obvious what is the right way to enable the workaround. The best way is to probably measure at boot time, otherwise how do you know which models and steppings are affected.

londons_explore2y ago

I don't think AMD does microcode updates for performance issues do they? I thought it was strictly correctness or security issues.

If the vendor won't patch it, then a workaround is the next best thing. There shouldn't be many - that's why all copying code is in just a handful of functions.

2 more replies

saagarjha2y ago

It’s not a trivial fix. Besides the fix likely being in microcode (where AMD figures out why aliasing is broke for addresses that are close to page-aligned), even a software mitigation would be complex because the kernel cannot actually use vector instructions that are typically used for the fallback path when ERMS is not available.

comonoid2y ago

jemalloc was Rust's default allocator till 2018.

https://internals.rust-lang.org/t/jemalloc-was-just-removed-...

1 more reply

a1o2y ago

> Rust developers might consider switching to jemallocator for improved performance

I am curious if this is something that everyone can do to get free performance or if there are caveats. Can C codebases benefit from this too? Is this performance that is simply left on table currently?

nh22y ago

Be aware `jemalloc` will make you suffer the observability issues of `MADV_FREE`. `htop` will no longer show the truth about how much memory is in use.

* https://github.com/jemalloc/jemalloc/issues/387#issuecomment...

* https://gitlab.haskell.org/ghc/ghc/-/issues/17411

Apparently now `jemalloc` will call `MADV_DONTNEED` 10 seconds after `MADV_FREE`: https://github.com/JuliaLang/julia/issues/51086#issuecomment...

So while this "fixes" the issue, it'll introduce a confusing time delay between you freeing the memory and you observing that in `htop`.

But according to https://jemalloc.net/jemalloc.3.html you can set `opt.muzzy_decay_ms = 0` to remove the delay.

Still, the musl author has some reservations against making `jemalloc` the default:

https://www.openwall.com/lists/musl/2018/04/23/2

> It's got serious bloat problems, problems with undermining ASLR, and is optimized pretty much only for being as fast as possible without caring how much memory you use.

With the above-mentioned tunables, this should be mitigated to some extent, but the general "theme" (focusing on e.g. performance vs memory usage) will likely still mean "it's a tradeoff" or "it's no tradeoff, but only if you set tunables to what you need".

singron2y ago

Note that glibc has a similar problem in multithreaded contexts. It strands unused memory in thread-local pools, which grows your memory usage over time like a memory leak. We got lower memory usage that didn't grow over time by switching to jemalloc.

Example of this: https://github.com/prestodb/presto/issues/8993

masklinn2y ago

The musl remark is funny, because jemalloc's use of pretty fine-grained arenas sometimes leads to better memory utilisation through reduced fragmentation. For instance Aerospike couldn't fit in available memory under (admittedly old) glibc, and jemalloc fixed the issue: http://highscalability.com/blog/2015/3/17/in-memory-computin...

And this is not a one-off: https://hackernoon.com/reducing-rails-memory-use-on-amazon-l... https://engineering.linkedin.com/blog/2021/taming-memory-fra...

jemalloc also has extensive observability / debugging capabilities, which can provide a useful global view of the system, it's been used to debug memleaks in JNI-bridge code: https://www.evanjones.ca/java-native-leak-bug.html https://technology.blog.gov.uk/2015/12/11/using-jemalloc-to-...

1 more reply

the84722y ago

Aiming to please people who panic about their RSS numbers seems... misguided? It seems like worrying about RAM being "used" as file cache[0].

If you want to gauge whether your system is memory-limited look at the PSI metrics instead.

[0] https://www.linuxatemyram.com/

1 more reply

saagarjha2y ago

Not that I would recommend using jemalloc by default but it’s definitely going to be better than musl’s allocator ;)

a1o2y ago

Thank you! That was very thorough! I will be reading the links. :)

dralley2y ago

glibc isn't totally free of such issues https://www.algolia.com/blog/engineering/when-allocators-are...

nicoburns2y ago

I think it's pretty much free performance that's being left on the table. There's slight cost to binary size. And it may not perform better in absolutely all circumstances (but it will in almost all).

Rust used to use jemalloc by default but switched as people found this surprising as the default.

Pop_-OP2y ago

Switching to non-default allocator does not always brings performance boost. It really depend on your workload, which requires profiling and benchmarking. But C/C++/Rust and other lower level languages should all at least be able to choose from these allocators. One caveat is binary size. Custom allocator does add more bytes to executable.

vlovich1232y ago

I don’t know why people still look to jemalloc. Mimalloc outperforms the standard allocator on nearly every single benchmark. Glibc’s allocator & jemalloc both are long in the tooth & don’t actually perform as well as state of the art allocators. I wish Rust would switch to mimalloc or the latest tcmalloc (not the one in gperftools).

1 more reply

charcircuit2y ago

I've never not gotten increased performance by swapping outc the allocator.

kelnos2y ago

Rust used to use jemalloc as the default, but went back to using the system malloc back in 2018-ish[0]. Since Rust now has the GlobalAlloc trait (and the #[global_allocator] attribute), apps can use jemalloc as their allocator if they want. Not sure if there's a way for users to override via LD_PRELOAD or something, though.

It turns out jemalloc isn't always best for every workload and use case. While the system allocator is often far from perfect, it at least has been widely tested as a general-purpose allocator.

[0] https://github.com/rust-lang/rust/issues/36963

saagarjha2y ago

Performance is not a one-dimensional scale where programs go from “slow” to “fast”, because there are always other factors at play. jemalloc can be the right fit for some applications but for others another choice might be faster, but it also might be that the choice is slower but better matches their goals (less dirty memory, better observability, certain security guarantees, …)

kragen2y ago

basically that's why jason wrote it in the first place, but other allocators have caught up since then to some extent. so jemalloc might make your c either slower or faster, you'll have to test to know. it's pretty reliable at being close to the best choice

does tend to use more ram tho

TillE2y ago

jemalloc and mimalloc are very popular in C and C++ software, yes. There are few drawbacks, and it's really easy to benchmark different allocators against eachother in your particular use case.

secondcoming2y ago

You can override the allocator for any app via LD_PRELOAD

amluto2y ago

I sent this to the right people.

saagarjha2y ago

(…at AMD?)

amluto2y ago

At AMD.

diamondlovesyou2y ago

AMD's string store is not like Intel's. Generally, you don't want to use it until you are past the CPU's L2 size (L3 is a victim cache), making ~2k WAY too small. Once past that point, it's profitable to use string store, and should run at "DRAM speed". But it has a high startup cost, hence 256bit vector loads/stores should be used until that threshold is met.

js22y ago

Isn't the high startup cost what FSRM is intended to solve?

> With the new Zen3 CPUs, Fast Short REP MOV (FSRM) is finally added to AMD’s CPU functions analog to Intel’s X86_FEATURE_FSRM. Intel had already introduced this in 2017 with the Ice Lake Client microarchitecture. But now AMD is obviously using this feature to increase the performance of REP MOVSB for short and very short operations. This improvement applies to Intel for string lengths between 1 and 128 bytes and one can assume that AMD’s implementation will look the same for compatibility reasons.

https://www.igorslab.de/en/cracks-on-the-core-3-yet-the-5-gh...

diamondlovesyou2y ago

Fast is relative here. These are microcoded instructions, which are generally terrible for latency: microcoded instructions don't get branch prediction benefits, nor OoO benefits (they lock the FE/scheduler while running). Small memcpy/moves are always latency bound, hence even if the HW supports "fast" rep store, you're better off not using them. L2 is wicked fast, and these copies are linear, so prediction will be good.

Note that for rep store to be better it must overcome the cost of the initial latency and then catch up to the 32byte vector copies, which yes generally have not-as-good-perf vs DRAM speed, but they aren't that bad either. Thus for small copies.... just don't use string store.

All this is not even considering non-temporal loads/stores; many larger copies would see better perf by not trashing the L2 cache, since the destination or source is often not inspected right after. String stores don't have a non-temporal option, so this has to be done with vectors.

1 more reply

rasz2y ago

Or you leave it as is forcing AMD to fix their shit. "fast string mode" has been strongly hinted as _the_ optimal way over 30 years ago with Pentium Pro, further enforced over 10 years ago with ERMSB and FSRM 4 years ago. AMD get with the program.

saagarjha2y ago

rep movsb might have been fast at one point but it definitely was not for a few decades in the middle, where vector stores were the fastest way to implement memcpy. Intel decided that they should probably make it fast again and they have slowly made it competitive with the extensions you’ve mentioned. But for processors that don’t support it, using rep movsb is going to be slow and probably not something you’d want to pick unless you have weird constraints (binary size?)

collinmanderson2y ago

BTW, I've always thought Python uses way too many syscalls when working with files. Simple code like this uses something like 9 syscalls (shown in the article):

    with open('myfile') as f:
        data = f.read()

I'm not much of a C programmer myself. but I at least reported part of the issue to Python: https://bugs.python.org/issue45944

This is the fastest way to read a file on python that I've found, using only 3-4 syscalls (though os.fstat() doesn't work for some special files kernel files like those in /proc/ and /dev/):

    def read_file(path: str, size=-1) -> bytes:
        fd = os.open(path, os.O_RDONLY)
        try:
            if size == -1:
                size = os.fstat(fd).st_size
            return os.read(fd, size)
        finally:
            os.close(fd)

the84722y ago

As you say, the reported size is not necessarily correct so it should only be treated as a hint. And if os.read directly translates to a read syscall then you're also not handling short reads.

collinmanderson2y ago

Ahh ok so to be correct you have to keep reading until you get an empty read?

Maybe I don’t need to query the file size at all?

1 more reply

forrestthewoods2y ago

Delightful article. Thank you author for sharing! I felt like I experienced every shock twist in surprise in your journey like I was right there with you all along.

Pesthuf2y ago

Clickbait headline, but the article is great!

saghm2y ago

I think there might be a range of where people draw the line between reasonable headlines and clickbait, because I tend to think of clickbait as something where the "answer" to some question is intentionally left out to try to bait people into clicking. For this article, something I'd consider clickbait would be something like "Rust std fs is slower than Python?" without the answer after. More commonly, the headline isn't phrased directly as a question, but instead of saying something like "So-and-so musician loves burritos", it will leave out the main detail and say something like "The meal so-and-so eats before every concert", which is trying to get you to click and have to read through lots of extraneous prose just to find the word "burritos".

Having a hook to get people to want to read the article is reasonable in my opinion; after all, if you could fit every detail in the size of a headline, you wouldn't need an article at all! Clickbait inverts this by _only_ having enough enough substance that you could get all the info in the headline, but instead it leaves out the one detail that's interesting and then pads it with fluff that you're forced to click and read through if you want the answer.

joshfee2y ago

Surprisingly I think this usage of clickbait is totally reasonable because it matches the author's initial thoughts/experiences of "what?! this can't be right..."

fulafel2y ago

A related thing from times when it was common that memory layout artifacts had high impact on sw performance: https://en.wikipedia.org/wiki/Cache_coloring

codedokode2y ago

Why is there need to move memory? Hardware cannot DMA data into non-page-aligned memory? Or Linux doesn't want to load non-aligned data?

wmf2y ago

The Linux page cache keeps data page-aligned so if you want the data to be unaligned Linux will copy it.

codedokode2y ago

What if I don't want to use cache?

2 more replies

titaniumtown2y ago

Extremely well written article! Very surprising outcome.

eigenform2y ago

would be lovely if ${cpu_vendor} would document exactly how FSRM/ERMS/etc are implemented and what the expected behavior is

saagarjha2y ago

It is documented; this is a performance bug.

lxe2y ago

I wonder what other things we can improve by removing spectre mitigations and tuning hugepage, syscall altency, and core affinity

saagarjha2y ago

Mitigations did not have a meaningful performance impact here.

Pop_-OP2y ago

Disclaimer: The title has been changed to "Rust std fs slower than Python!? No, it's hardware!" to avoid clickbait. However I'm not able to fix the title in HN.

sharperguy2y ago

"Works on contingency? No, money down!"

pvg2y ago

you can mail hn@ycombinator.com and they can change it for you to whatever.

3cats-in-a-coat2y ago

What's the TLDR on how... hardware performs differently on two software runtimes?

pornel2y ago

AMD's implementation of `rep movsb` instruction is surprisingly slow when addresses are page aligned. Python's allocator happens to add a 16-byte offset that avoids the hardware quirk/bug.

1 more reply

lynndotpy2y ago

One of the very first things in the article is a TLDR section that points you to the conclusion.

> In conclusion, the issue isn't software-related. Python outperforms C/Rust due to an AMD CPU bug.

1 more reply

pmontra2y ago

> However, mmap has other uses too. It's commonly used to allocate large regions of memory for applications.

Slack is allocating 1132 GB of virtual memory on my laptop right now. I don't know if they are using mmap but that's 1100 GB more than the physical memory.

aseipp2y ago

That is Chromium doing it, and yes, it is using mmap to create a very large, (almost certainly) contiguous range of memory. Many runtimes do this, because it's useful (on 64-bit systems) to create a ridiculously large virtually mapped address space and then only commit small parts of it over time as needed, because it makes memory allocation simpler in several ways; notably it means you don't have to worry about allocating new address spaces when simply allocating memory, and it means answering things like "Is this a heap object?" is easier.

rasz2y ago

dolphin emulator has recent example of this: https://dolphin-emu.org/blog/2023/11/25/dolphin-progress-rep...

seems its not without perils on Windows:

"In an ideal world, that would be all we have to say about the new solution. But for Windows users, there's a special quirk. On most operating systems, we can use a special flag to signal that we don't really care if the system has 32 GiB of real memory. Unfortunately, Windows has no convenient way to do this. Dolphin still works fine on Windows computers that have less than 32 GiB of RAM, but if Windows is set to automatically manage the size of the page file, which is the case by default, starting any game in Dolphin will cause the page file to balloon in size. Dolphin isn't actually writing to all this newly allocated space in the page file, so there are no concerns about performance or disk lifetime. Also, Windows won't try to grow the page file beyond the amount of available disk space, and the page file shrinks back to its previous size when you close Dolphin, so for the most part there are no real consequences... "

Waterluvian2y ago

I’m not sure allocations mean anything practical anymore. I recall OSX allocating ridiculous amounts of virtual memory to stuff but never found OSX or the software to ever feel slow and pagey.

dietrichepp2y ago

The way I describe mmap these days is to say it allocates address space. This can sometimes be a clearer way of describing it, since the physical memory will only get allocated once you use the memory (maybe never).

1 more reply

Pop_-OP2y ago

I don't know why but this really makes me laugh

explodingwaffle2y ago

Anyone else feeling the frequency illusion with rep movsb?

(https://lock.cmpxchg8b.com/reptar.html)

saagarjha2y ago

This is unrelated.

sgift2y ago

Either the author changed the headline to something less clickbaity in the meantime or you edited it for clickbait Pop_- (in that case: shame on you) - current headline: "Rust std fs slower than Python!? No, it's hardware!"

epage2y ago

Based on the /r/rust thread, the author seemed to change the headline based on feedback to make it less clickbait-y

xuanwo2y ago

Sorry for the clickbaity title, I have changed it based on others advice.

thechao2y ago

I disagree that it's clickbait-y. Diving down from Python bindings to ucode is ... not how things usually go. Doubly so, since Python is a very mature runtime, and I'd be inclined to believe they've dug up file-reading Kung Fu not available to the Average Joe.

jll292y ago

Thanks for this unexpected, thriller-like read.

I'm impressed by your perseverance, how you follow through with your investigation to the lowest (hardware) level.

Pop_-OP2y ago

The author has updated the title and also contacted me. But unfortunately I'm no longer able to update it so.

darkwater2y ago

Totally unrelated but: this post talks about the bug being first discovered in OpenDAL [1], which seems to be an Apache (Incubator) project to add an abstraction layer for storage over several types of storage backend. What's the point/use case of such an abstraction? Anybody using it?

[1] https://opendal.apache.org/

1 more reply

exxos2y ago

It's the hardware. Of course Rust remains the fastest and safest language and you must rewrite your applications in Rust.

dang2y ago

You've been posting like this so frequently as to cross into abusing the forum, so I've banned the account.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.

lxe2y ago

So Python isn't affected by the bug because pymalloc performs better on buggy CPUs than jemalloc or malloc?

js22y ago

It has nothing to do with pymalloc's performance per se.

Rather, the performance issue only occurs when using `rep movsb` on AMD CPUs with certain page/data alignment.

Pymalloc just happens to be using page/data alignment that makes `rep movsb` happy while Rust's default allocator is using alignments that just happen to make `rep movsb` sad.

jokethrowaway2y ago

Clickbait title but interesting article.

This has nothing to do with python or rust

drtgh2y ago

>Rust std fs slower than Python!? No, it's hardware!

>...

>Python features three memory domains, each representing different allocation strategies and optimized for various purposes.

>...

>Rust is slower than Python only on my machine.

if one library performs wildly better than the other in the same test, on the same hardware, how can that not be a software-related problem? sounds like a contradiction.

Maybe should be considered a coding issue and/or feature absent? IMHO it would be expected Rust's std library perform well without making all the users to circumvent the issue manually.

The article is well investigated so I assume the author just want to show the problem existence without creating controversy because other way I can not understand.

Pop_-OP2y ago

The root cause is AMD's bad support for rep movsb (which is a hardware problem). However, python by default has a small offset when reading memories while lower level language (rust and c) does not, which is why python seems to perform better than c/rust. It "accidentally" avoided the hardware problem.

formerly_proven2y ago

That extra 0x20 (32 byte) offset is the size of the PyBytes object header for anyone wondering; 64 bits each for type object pointer, reference count, base pointer and item count.

1 more reply

meneer_oke2y ago

It doesn't seem faster. Seem would imply that it isn't the case. It is faster currently on that setup.

But since python runtime is written in C, the issue can't be Python vs C.

2 more replies

magicalhippo2y ago

I recall when Pentium was introduced we were told to avoid rep and write a carefully tuned loop ourselves. To go really fast one could use the FPU to do the loads and stores.

Not too long ago I read in Intel's optimization guidelines that rep was now faster again and should be used.

Seems most of these things needs to be benchmarked on the CPU, as they change "all the time". I've sped up plenty of code by just replacing hand crafted assembly with high-level functional equivalent code.

Of course so-slow-it's-bad is different, however a runtime-determined implementation choice would avoid that as well.

CoastalCoder2y ago

I'm not sure it makes sense to pin this only on AMD.

Whenever you're writing performance-critical software, you need to consider the relevant combinations of hardware + software + workload + configuration.

Sometimes a problem can be created or fixed by adjusting any one / some subset of those details.

2 more replies

mwcampbell2y ago

Years ago, Rust's standard library used jemalloc. That decision substantially increased the minimum executable size, though. I didn't publicly complain about it back then (as far as I can recall), but perhaps others did. So the Rust library team switched to using the OS's allocator by default.

Maybe using an alternative allocator only solves the problem by accident and there's another way to solve it intentionally; I don't yet fully understand the problem. My point is that using a different allocator by default was already tried.

saghm2y ago

> I didn't publicly complain about it back then (as far as I can recall), but perhaps others did. So the Rust library team switched to using the OS's allocator by default.

I've honestly never worked in a domain where binary size ever really mattered beyond maybe invoking `strip` on a binary before deploying it, so I try to keep an open mind. That said, this has always been a topic of discussion around Rust[0], and while I obviously don't have anything against binary sizes being smaller, bugs like this do make me wonder about huge changes like switching the default allocator where we can't really test all of the potential side effects; next time, the unintended consequences might not be worth the tradeoff.

[0]: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

j / k navigate · click thread line to collapse

240 comments

the84722y ago

And yet here we are again. Shouldn't this be part of some timing testsuite of CPU vendors by now?

gpderetta2y ago

renonce2y ago

Then fast rep movs should have been disabled in cpuid altogether

1 more reply

giancarlostoro2y ago

fweimer2y ago

The exact nature of the fix is unclear at present.

1 more reply

the84722y ago

More broadly compatible routines will still work on newer CPUs, they just won yield the best performance.

2 more replies

dralley2y ago

ww5202y ago

Since the CPU instructions are the same, instruction patching at startup or install time can be used. Just patch in the correct instructions for the respective hardware.

1 more reply

immibis2y ago

glibc has the ability to dynamically link a different version of a function based on the CPU.

mike_hock2y ago

You'd think the CPU vendor knows their CPU best. If there's a faster "software" implementation, why doesn't REP MOVS at least do the same thing in microcode?

Aissen2y ago

Associated glibc bug (Zen 4 though): https://sourceware.org/bugzilla/show_bug.cgi?id=30994

fweimer2y ago

And AMD is investigating: https://inbox.sourceware.org/libc-alpha/20231115190559.29112...

Arnavion2y ago

The bug is also about Zen 3, and even mentions the 5900X (the article author's CPU).

nabakin2y ago

If you read the bug tracker, a comment mentions this affects Zen 3 and Zen 4

royjacobs2y ago

I was prepared to read the article and scoff at the author's misuse of std::fs. However, the article is a delightful succession of rabbit holes and mysteries. Well written and very interesting!

bri3d2y ago

This was such a good article! The debugging was smart (writing test programs to peel each layer off), the conclusion was fascinating and unexpected, and the writing was clear and easy to follow.

quietbritishjim2y ago

lambda2y ago

I'm a bit confused by why you are confused.

In this case, it turns out that neither were the case; there's just a weird hardware performance cliff based on the exact alignment of an allocation on particular hardware.

quietbritishjim2y ago

qd0112y ago

I don't understand why Python gets shit for being a slow language when it's slow but no credit for being fast when it's fast just because "it's not really Python".

If I write Python and my code is fast, to me that sounds like Python is fast, I couldn't care less whether it's because the implementation is in another language or for some other reason.

kbenson2y ago

3 more replies

benrutter2y ago

I wonder if its because we're sometimes talking cross purposes.

For me, coding is almost exclusively using python libraries like numpy to call out to other languages like c or FORTRAN. It feels silly to say I'm not coding in Python to me.

1 more reply

rafaelmn2y ago

It's completely fair to say that's not python because it isn't - any language out there can FFI to C and it has the same problems mentioned above.

IshKebab2y ago

Because when people talk about Python performance they're talking about the performance of Python code itself, not C/Rust code that it's wrapping.

Pretty much any language can wrap C/Rust code.

Why does it matter?

1. Having to split your code across 2 languages via FFI is a huge pain.

2. You are still writing some Python. There's plenty of code that is pure Python. That code is slow.

1 more reply

afdbcreid2y ago

Usually, yes, but when it's a bug in the hardware, it's not really that Python is fast, more like that CPython developers were lucky enough to not have the bug.

1 more reply

analog312y ago

Also, when we talk about "faster" and "slower," it's not clear the order of magnitude.

1 more reply

insanitybit2y ago

>I don't understand why Python gets shit for being a slow language when it's slow but no credit for being fast when it's fast just because "it's not really Python".

What's there to understand? When it's fast it's not really Python, it's C. C is fast. Python can call out to C. You don't have to care that the implementation is in another language, but it is.

p5a0u9l2y ago

99% of my use cases are easily, maintainably solved with good, modern Python. The Python execution is almost never the bottleneck in my workflows. It’s disk or network I/O.

paulddraper2y ago

Yeah, it's weird.

crabbone2y ago

> individually, highly optimised.

Now why would you expect that?

codr72y ago

Have you ever attempted to write a scripting language that performs better?

I have, several, and it's far from trivial.

The basics are seriously optimized for typical use cases, take a look at the source code for the dict type.

4 more replies

xuanwo2y ago

Thanks for the comments. I have fixed the headers :)

fl0ki2y ago

The premise is that any time you say "Python [...] faster than Rust [...]" you get page views even if it's not true. People have noticed after the last few dozen times something like this was posted.

p5a0u9l2y ago

On the other hand… so what? It’s kind of fun.

fsniper2y ago

The article itself is a great read and it has fascinating info related to this issue.

However I am more interested/concerned about another part. How the issue is reported/recorded and how the communications are handled.

Reporting is done over discord, which is a proprietary environment which is not indexed, or searchable. Will not be archived.

Communications and deliberations are done over discord and telegram, which is probably worse than discord in this context.

This blog post and the github repository is the lingering remains of them. If Xuanwo did not blog this. It would be lost in timeline.

Isn't this fascinating?

upsuper2y ago

fsniper2y ago

This is not a way to have bug discussions, or record them. Do you really think I could find this information on a search for a similar issue?

Only thing that makes this bug and the process of the debug visible is this blog post.

1 more reply

jll292y ago

> Reporting is done over discord, which is a proprietary environment which is not indexed, or searchable. Will not be archived.

That's why I don't accept the response "but there's Discord now" whenever I moan about USENET's demise. Back in the days before it, every post was nicely searchable by DejaNews (later Google).

We need to get back to open standards for important communications (e.g. all open source projects that are important to the Internet/WWW stack and core programming and libraries).

iampims2y ago

Most interesting article I've read this week. Excellent write-up.

londons_explore2y ago

p3n1s2y ago

Not obvious. Seems like if it can be corrected with microcode just have people use updated microcode rather than litter the kernel with fixes that are effectively patchable software problems.

londons_explore2y ago

I don't think AMD does microcode updates for performance issues do they? I thought it was strictly correctness or security issues.

If the vendor won't patch it, then a workaround is the next best thing. There shouldn't be many - that's why all copying code is in just a handful of functions.

2 more replies

saagarjha2y ago

comonoid2y ago

jemalloc was Rust's default allocator till 2018.

https://internals.rust-lang.org/t/jemalloc-was-just-removed-...

1 more reply

a1o2y ago

> Rust developers might consider switching to jemallocator for improved performance

nh22y ago

Be aware `jemalloc` will make you suffer the observability issues of `MADV_FREE`. `htop` will no longer show the truth about how much memory is in use.

* https://github.com/jemalloc/jemalloc/issues/387#issuecomment...

* https://gitlab.haskell.org/ghc/ghc/-/issues/17411

Apparently now `jemalloc` will call `MADV_DONTNEED` 10 seconds after `MADV_FREE`: https://github.com/JuliaLang/julia/issues/51086#issuecomment...

So while this "fixes" the issue, it'll introduce a confusing time delay between you freeing the memory and you observing that in `htop`.

But according to https://jemalloc.net/jemalloc.3.html you can set `opt.muzzy_decay_ms = 0` to remove the delay.

Still, the musl author has some reservations against making `jemalloc` the default:

https://www.openwall.com/lists/musl/2018/04/23/2

> It's got serious bloat problems, problems with undermining ASLR, and is optimized pretty much only for being as fast as possible without caring how much memory you use.

singron2y ago

Example of this: https://github.com/prestodb/presto/issues/8993

masklinn2y ago

And this is not a one-off: https://hackernoon.com/reducing-rails-memory-use-on-amazon-l... https://engineering.linkedin.com/blog/2021/taming-memory-fra...

1 more reply

the84722y ago

Aiming to please people who panic about their RSS numbers seems... misguided? It seems like worrying about RAM being "used" as file cache[0].

If you want to gauge whether your system is memory-limited look at the PSI metrics instead.

[0] https://www.linuxatemyram.com/

1 more reply

saagarjha2y ago

Not that I would recommend using jemalloc by default but it’s definitely going to be better than musl’s allocator ;)

a1o2y ago

Thank you! That was very thorough! I will be reading the links. :)

dralley2y ago

glibc isn't totally free of such issues https://www.algolia.com/blog/engineering/when-allocators-are...

nicoburns2y ago

Rust used to use jemalloc by default but switched as people found this surprising as the default.

Pop_-OP2y ago

vlovich1232y ago

1 more reply

charcircuit2y ago

I've never not gotten increased performance by swapping outc the allocator.

kelnos2y ago

It turns out jemalloc isn't always best for every workload and use case. While the system allocator is often far from perfect, it at least has been widely tested as a general-purpose allocator.

[0] https://github.com/rust-lang/rust/issues/36963

saagarjha2y ago

kragen2y ago

does tend to use more ram tho

TillE2y ago

jemalloc and mimalloc are very popular in C and C++ software, yes. There are few drawbacks, and it's really easy to benchmark different allocators against eachother in your particular use case.

secondcoming2y ago

You can override the allocator for any app via LD_PRELOAD

amluto2y ago

I sent this to the right people.

saagarjha2y ago

(…at AMD?)

amluto2y ago

At AMD.

diamondlovesyou2y ago

js22y ago

Isn't the high startup cost what FSRM is intended to solve?

https://www.igorslab.de/en/cracks-on-the-core-3-yet-the-5-gh...

diamondlovesyou2y ago

1 more reply

rasz2y ago

saagarjha2y ago

collinmanderson2y ago

BTW, I've always thought Python uses way too many syscalls when working with files. Simple code like this uses something like 9 syscalls (shown in the article):

    with open('myfile') as f:
        data = f.read()

I'm not much of a C programmer myself. but I at least reported part of the issue to Python: https://bugs.python.org/issue45944

This is the fastest way to read a file on python that I've found, using only 3-4 syscalls (though os.fstat() doesn't work for some special files kernel files like those in /proc/ and /dev/):

    def read_file(path: str, size=-1) -> bytes:
        fd = os.open(path, os.O_RDONLY)
        try:
            if size == -1:
                size = os.fstat(fd).st_size
            return os.read(fd, size)
        finally:
            os.close(fd)

the84722y ago

As you say, the reported size is not necessarily correct so it should only be treated as a hint. And if os.read directly translates to a read syscall then you're also not handling short reads.

collinmanderson2y ago

Ahh ok so to be correct you have to keep reading until you get an empty read?

Maybe I don’t need to query the file size at all?

1 more reply

forrestthewoods2y ago

Delightful article. Thank you author for sharing! I felt like I experienced every shock twist in surprise in your journey like I was right there with you all along.

Pesthuf2y ago

Clickbait headline, but the article is great!

saghm2y ago

joshfee2y ago

Surprisingly I think this usage of clickbait is totally reasonable because it matches the author's initial thoughts/experiences of "what?! this can't be right..."

fulafel2y ago

A related thing from times when it was common that memory layout artifacts had high impact on sw performance: https://en.wikipedia.org/wiki/Cache_coloring

codedokode2y ago

Why is there need to move memory? Hardware cannot DMA data into non-page-aligned memory? Or Linux doesn't want to load non-aligned data?

wmf2y ago

The Linux page cache keeps data page-aligned so if you want the data to be unaligned Linux will copy it.

codedokode2y ago

What if I don't want to use cache?

2 more replies

titaniumtown2y ago

Extremely well written article! Very surprising outcome.

eigenform2y ago

would be lovely if ${cpu_vendor} would document exactly how FSRM/ERMS/etc are implemented and what the expected behavior is

saagarjha2y ago

It is documented; this is a performance bug.

lxe2y ago

I wonder what other things we can improve by removing spectre mitigations and tuning hugepage, syscall altency, and core affinity

saagarjha2y ago

Mitigations did not have a meaningful performance impact here.

Pop_-OP2y ago

Disclaimer: The title has been changed to "Rust std fs slower than Python!? No, it's hardware!" to avoid clickbait. However I'm not able to fix the title in HN.

sharperguy2y ago

"Works on contingency? No, money down!"

pvg2y ago

you can mail hn@ycombinator.com and they can change it for you to whatever.

3cats-in-a-coat2y ago

What's the TLDR on how... hardware performs differently on two software runtimes?

pornel2y ago

AMD's implementation of `rep movsb` instruction is surprisingly slow when addresses are page aligned. Python's allocator happens to add a 16-byte offset that avoids the hardware quirk/bug.

1 more reply

lynndotpy2y ago

One of the very first things in the article is a TLDR section that points you to the conclusion.

> In conclusion, the issue isn't software-related. Python outperforms C/Rust due to an AMD CPU bug.

1 more reply

pmontra2y ago

> However, mmap has other uses too. It's commonly used to allocate large regions of memory for applications.

Slack is allocating 1132 GB of virtual memory on my laptop right now. I don't know if they are using mmap but that's 1100 GB more than the physical memory.

aseipp2y ago

rasz2y ago

dolphin emulator has recent example of this: https://dolphin-emu.org/blog/2023/11/25/dolphin-progress-rep...

seems its not without perils on Windows:

Waterluvian2y ago

I’m not sure allocations mean anything practical anymore. I recall OSX allocating ridiculous amounts of virtual memory to stuff but never found OSX or the software to ever feel slow and pagey.

dietrichepp2y ago

1 more reply

Pop_-OP2y ago

I don't know why but this really makes me laugh

explodingwaffle2y ago

Anyone else feeling the frequency illusion with rep movsb?

(https://lock.cmpxchg8b.com/reptar.html)

saagarjha2y ago

This is unrelated.

sgift2y ago

epage2y ago

Based on the /r/rust thread, the author seemed to change the headline based on feedback to make it less clickbait-y

xuanwo2y ago

Sorry for the clickbaity title, I have changed it based on others advice.

thechao2y ago

jll292y ago

Thanks for this unexpected, thriller-like read.

I'm impressed by your perseverance, how you follow through with your investigation to the lowest (hardware) level.

Pop_-OP2y ago

The author has updated the title and also contacted me. But unfortunately I'm no longer able to update it so.

darkwater2y ago

[1] https://opendal.apache.org/

1 more reply

exxos2y ago

It's the hardware. Of course Rust remains the fastest and safest language and you must rewrite your applications in Rust.

dang2y ago

You've been posting like this so frequently as to cross into abusing the forum, so I've banned the account.

lxe2y ago

So Python isn't affected by the bug because pymalloc performs better on buggy CPUs than jemalloc or malloc?

js22y ago

It has nothing to do with pymalloc's performance per se.

Rather, the performance issue only occurs when using `rep movsb` on AMD CPUs with certain page/data alignment.

Pymalloc just happens to be using page/data alignment that makes `rep movsb` happy while Rust's default allocator is using alignments that just happen to make `rep movsb` sad.

jokethrowaway2y ago

Clickbait title but interesting article.

This has nothing to do with python or rust

drtgh2y ago

>Rust std fs slower than Python!? No, it's hardware!

>...

>Python features three memory domains, each representing different allocation strategies and optimized for various purposes.

>...

>Rust is slower than Python only on my machine.

if one library performs wildly better than the other in the same test, on the same hardware, how can that not be a software-related problem? sounds like a contradiction.

Maybe should be considered a coding issue and/or feature absent? IMHO it would be expected Rust's std library perform well without making all the users to circumvent the issue manually.

The article is well investigated so I assume the author just want to show the problem existence without creating controversy because other way I can not understand.

Pop_-OP2y ago

formerly_proven2y ago

That extra 0x20 (32 byte) offset is the size of the PyBytes object header for anyone wondering; 64 bits each for type object pointer, reference count, base pointer and item count.

1 more reply

meneer_oke2y ago

It doesn't seem faster. Seem would imply that it isn't the case. It is faster currently on that setup.

But since python runtime is written in C, the issue can't be Python vs C.

2 more replies

magicalhippo2y ago

I recall when Pentium was introduced we were told to avoid rep and write a carefully tuned loop ourselves. To go really fast one could use the FPU to do the loads and stores.

Not too long ago I read in Intel's optimization guidelines that rep was now faster again and should be used.

Of course so-slow-it's-bad is different, however a runtime-determined implementation choice would avoid that as well.

CoastalCoder2y ago

I'm not sure it makes sense to pin this only on AMD.

Whenever you're writing performance-critical software, you need to consider the relevant combinations of hardware + software + workload + configuration.

Sometimes a problem can be created or fixed by adjusting any one / some subset of those details.

2 more replies

mwcampbell2y ago

saghm2y ago

> I didn't publicly complain about it back then (as far as I can recall), but perhaps others did. So the Rust library team switched to using the OS's allocator by default.

[0]: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

j / k navigate · click thread line to collapse