Accidentally quadratic: When Python is faster than C++ (opens in new tab)

(arxiv.org)

218 pointsmehrdadn5y ago213 comments

213 comments

If you're wondering whether this is a theoretical or practical problem: I actually observed some of this effect in practice first, and only after thinking about it for a while did the larger issue (and the complexity implications) dawn on me.

I had something like a set<string> or a set<set<string>> (or map... I can't remember which) somewhere in my program a few years ago, and I was trying to improve the program's performance. I tried breaking into it several times and found it quite perplexing that the bottleneck appeared to be the set<> container. I mean, I get that cache locality and all has an effect, but it seemed to be having a little too much of an effect. Why did I find this odd? Because other languages (like C#) have tree-based sets and maps too, but I'd never felt they were quite as slow as I was seeing them in C++. So I felt something weird must be going on.

I tried to step through the program for a while and observe what's going on, and at some point, I realized (perhaps I hit the same breakpoint twice? I don't recall) that these functions were being called more often than I intuitively thought would be necessary. Which was obvious in hindsight, as less() needs to be called multiple times on the same objects (4 times at level 2). Now, this hadn't resulted in quadratic behavior, but that was only because my data structures weren't arbitrarily deep—the slowdown was nevertheless a significant constant factor at the core of my program, only linear because the data structure's depth was bounded.

So once I had realized the implications of this, including that constant-factor differences can actually turn into polynomial ones, I eventually decided to post an article about it on arXiv... hence this article. Aside from the complexity issue illustrated in the article, one of my (unexpected) higher-level takeaways was that you can't just take any utility function in a program and blindly put it into a library: even if it's correct, you probably have hidden assumptions about its use cases that won't hold true in general. You really need to think about how it could be used in ways you didn't initially expect, and this may well require a different kind of code review with an entirely different mindset than before. It really drove home the point for me that writing a library is quite a bit different (and in some ways more difficult) than writing an application.

It's possible there is more theory lying underneath this that I haven't touched on—it would be interesting if someone can take this and find more interesting consequences of it. For example, maybe when analyzing algorithms we need to consider something akin to a "recursive time complexity", too? Is there another useful notion of complexity that we can generalize here?

Anyway, hope people enjoy reading it and take something useful away from it. :)

nivenkos5y ago

By the way for the hashing example in count() - Rust has the Entry API for HashMaps to avoid exactly this problem: https://doc.rust-lang.org/std/collections/struct.HashMap.htm...

The PartialOrd trait also uses 3-way comparisons so I think the other issue is mitigated too, but it'd be interesting to check: https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html#tyme...

dig15y ago

I'm failing to understand some things, maybe because I glanced through the paper with morning coffee. Apology if my tone ends up a little bit harsh, but these are just constructive criticism :)

The first one is lt2 and lt3 implementations. You are implementing cmp2 through lt2, but lt3 through cmp3 (is this omission?). Both of them are stack-sensitive. Without being too harsh, I'm getting the impression that the intention was to write the most horrible comparison possible, which is different than worst-case time complexity.

In the paper, lt2 (actually cmp2 in the paper) will always be at least two passes, and lt3 is at least one pass. I would not say they are two/single pass algorithms because complexity increases depending on list depth when lists are involved.

Maybe I'm wrong, but both Python and C++ comparison operators are designed to be general-purpose comparison functions (and I'm more sure about C++, because this was touted through hundreds of books). As such, they should be good enough for most average cases. If you want speed, you go with balanced trees or something funkier.

Also, for C++ Tree implementation, you are again using probably the worst approach - appending to vector recursively. Use list for this. Python list implements all sorts of tricks, comparing to C++ vector.

And the last thing, but not the least: C++ containers depend on implementation (gcc libstdc++, stlport, msvc whatever), and I've seen substantial speed differences in standard operations. Hell, my old (almost conforming) list implementation was much faster than libstdc++ implementation because it wasn't trying to be too clever with slices and other magic.

I'm sad you haven't used a more scientific approach with much more rigor here: what C++ compiler was used, what version, what assembly output was produced, on what processor, after how many runs, etc... Claiming "Python is faster than C++" sounds like a clickbait title.

mehrdadnOP5y ago

I'm not trying to write the most horrible comparison at all. Perhaps the most important thing to keep in mind here is that this is a general computer science paper, and my comparison of C++ and Python is just intended to serve as a familiar (and vivid) illustrative example of the general phenomenon I'm trying to describe. The paper is emphatically not intended to be a "Python vs. C++" paper. Everything you see there that is "concrete" (the language, the running times, etc.) is intended to be a mere illustration of the overarching concept (design decisions & their consequences) being discussed, and it could manifest itself in any language.

The context to keep in mind when reading the paper is: When designing a programming language & its standard library (or any API), we need to define an interface we can use as a building block, and we're analyzing the consequences of our choice of building blocks. In particular, we first examine the case of comparison-based data structures, which requires defining ordering primitives. In C++, the primitive is the < operator. In Python 2, it's cmp(). (In Python 3, it's a mix of < and ==, whose implications I discuss as well.) We assume user-provided types implement that basic interface, and we implement everything else we need on top of that.

So the question I'm analyzing in that example is: What happens if my primitive comparison operation is a 2-way comparison (lt(), like in C++) and then I implement 3-way comparison in terms of that (such as when I need it for for a binary search tree)? Now, what if we do the opposite: what happens if instead my primitive comparison operation is a 3-way comparison (cmp(), like in Python 2) and I only need to perform a 2-way comparison later? What are the trade-offs?

To do this, I take both approaches, implementing each in terms of the other, and compare how they behave complexity-wise. The conclusion I come to is that the choice of the primitive (which is often made by the language designers) isn't merely a question of aesthetics, but rather, it actually affects the time complexity of common operations (like traversing search trees). Similarly, the decision to cache a hash code doesn't just result in a constant-factor improvement, but it can actually change the time complexity of a program. And so on.

I think if you re-read the paper with these in mind, it should hopefully make more sense. The rest of what you said doesn't enter the picture at all... these are already balanced binary trees, the decision to use less<> is fundamentally independent of what C++ stdlib implementation you use, and the time of the vector concatenation isn't even being measured. Those things are unrelated to the point of the paper entirely. I was just trying to minimize the extraneous lines of code so we can focus on the heart of the problem instead of getting distracted by boilerplate.

attractivechaos5y ago

Thanks for the interesting case. I guess many readers have read too much into lt2 and lt3 but overlooked the code under 2.4.2: C++ could actually be that bad. In that code, you only showed the timing in comments. The result may be worth a table. Perhaps another way to structure the manuscript is to give the surprising result of 2.4.2 first and then to explain what makes that happen.

1 more reply

einpoklum5y ago

1. People often use set instead of unordered_set (and same for map) despite not needing order. This slows things down.

2. The C++ standard library's maps and sets are known to be rather slow. See, for example:

https://stackoverflow.com/q/42588264/1593077

when you have string values, it's even worse, as you describe. But it's not clear that an overly-clever implementation, which caches numeric ranks of strings etc., is a good idea to have.

mehrdadnOP5y ago

Ordering is not the only concern here. std::set actually provides a logarithmic worst-case guarantee, whereas std::unordered_set does not. This is a factor to consider depending on the application, regardless of whether ordering is necessary. Whichever one prefers in any case, though, is beside my point—I'm merely trying to use trees and hashtables to illustrate a far more general CS phenomenon that can occur in lots of data structures and languages.

kllrnohj5y ago

If performance is a concern then you should still avoid std::set though by default. Logarithmic worst case when it's just always slow isn't really useful.

There may be a benchmark out there where std::set can beat std::unordered_set, but you'll be really hard pressed to find it

1 more reply

shultays5y ago

Imho "but technically..." is not a valid opinion while in practice the access is O(1) on average. Yea sure, it becomes linear if your hash is "return 42;", it can't grantee that you supplied a good hasher.

4 more replies

cesarb5y ago

> People often use set instead of unordered_set (and same for map) despite not needing order. This slows things down.

Aren't unordered_set and unordered_map quite new (IIRC, they came only with C++0x)? For most of C++'s history, if you preferred to use the standard library, what you had was only ordered sets and maps.

vlovich1235y ago

Depends what’s your definition of “quite new”. That’s over 10 years go now :)

asdfasgasdgasdg5y ago

They've been in the language for about a decade now, and before that we had the tr1 hash_map classes that were available in most environments.

1 more reply

einpoklum5y ago

By now, this is no longer "quite new". A recent C++ community survey suggests that under 10% of developers currently use C++98/C++03 the most. Naturally this is not a valid sample of the whole userbase, but it's a good indication.

Also, in 2005 IIANM, Google published their dense and sparse hash map implementations, which were fast-ish and quite usable.

stonemetal125y ago

They were in TR1 in 2005, if I am not mistaken.

brundolf5y ago

Why is the default set implementation ordered in the first place? The formal data structure is unordered, which probably informs people's assumptions about its performance characteristics. Should it not be "set" and "ordered_set"?

hashingroll5y ago

One reason would be because std::set guarantees O(log n) complexity for each operation on the worst case. std::unordered_set has average complexity O(1) and O(size) for the worst case (it being a hash table) which can be unexpected to debug in those rare cases.

mannerheim5y ago

Probably because the underlying implementation is a binary search tree which requires a comparison operator. Haskell likewise has Data.Set which requires an Ord instance. There's HashSet which requires a Hashable instance which can be automatically derived for any data type, so in principle you don't need ordering, but I would imagine in C++ you would have to provide your own hashing function from whatever data type you're trying to put in a hash set.

masklinn5y ago

> Why is the default set implementation ordered in the first place?

"Sorted" rather than ordered (an order can be arbitrary and I'd personally associate the word with insertion order).

> The formal data structure is unordered

The formal data structure doesn't have complexity bounds, the STL does.

einpoklum5y ago

It's a mis-feature. And since C++ is very committed to backwards compatibility, they didn't change it later on.

If/when `std2::` happens, this is one of the things I assume would change.

a-dub5y ago

cpython is faster than c++ under the following circumstances:

1) when you're using native libraries via python that were written by better c/c++ programmers than you are and you're spending most of your time within them

2) when you're using native libraries in python that are better implementations than the c/c++ libraries you're comparing against

3) when you don't know the libraries you're using in c/c++ (what they're talking about here)

...otherwise, if you're just using doing basic control flow optimizing compiler c/c++ will almost always be faster. (unless you're using numba or pypy or something).

point stands about the constants though. yes, asymptotic analysis will tell you about how an algorithm's performance scales, but if you're just looking for raw performance, especially if you have "little data", you really have to measure as implementation details of the software and hardware start to have a greater effect. (often times the growth of execution time isn't even smooth with the growth of n for small n)

acdha5y ago

I think a key part here is also being realistic about the time available to write and optimize your program. I’ve seen Python completely crush C++ a fair number of times (order of magnitude or better) and it basically came down to the C++ programmer having bitten off more than they could chew, spending so much time on the basics that they never got around to optimizing algorithms or the data layout. (One variation: Python hashlib > whatever C textbook example you copy-and-pasted because you thought calling OpenSSL was too much work)

This is frustrating for programmers because everyone wants to focus on the cool part of a program and forgets how much the rest takes to write, debug, etc. There are many reasons why I prefer Rust but one of the biggest is simply that having a standard package manager means that you can get high-quality code as easily as in languages like Python or JavaScript and are more likely to avoid that pitfall of reinventing wheels because it initially seems easier than finding a library.

a-dub5y ago

yeah, the c/c++ ecosystem never really had the benefit of a internet connected curated library community. afaik the first big example of that was perl in the late '90s. CPAN was awesome: here's this big library of awesome libraries that have been curated with full tests and documentation that you can search and add to your system with a few easy cli invocations. (for the uninitiated, this was npm or pip for perl in the 90s- complete with dedicate wikipedian level pedants gatekeeping/curating)

moreover, batteries included scripting languages like perl, python, matlab, etc all tend to have the benefit of having their core bits be best of breed. perl has/had one of the best re engines out there, matlab has a great blas all compiled with the best optimizing compiler and ready to go, python was more generic i suppose, with fairly strong libraries for most things (strings, io, network io, etc).

other than the microsoft nuget stuff, the c/c++ ecosystem never really had the benefit of anything like that other than boost which was pretty tough to pull into a given project and didn't really have the community of people writing high level libraries like the scripting languages did. that said, i often used to think it would have been interesting to build a language agnostic platform for language centered library communities. (a sort of generic cpan/pip/npm in a box for pulling down libraries and running tests for any language- a combination of build system, online community platform and search engine)

but the real moral of the story: use the libraries, luke/lucia! also, know them!

acdha5y ago

CPAN definitely deserves more attention, especially for the emphasis on testing which many successors still don't have. That was even more necessary back when OS consistency was worse but it really should have been seen as a first-class feature.

I think C/C++ also had this problem with the whole cultural evolution around shared libraries. Because installs were non-trivial I think there was an almost subconscious blinder effect where people restricted themselves to what shipped with the OS / distribution even if that meant keeping a workaround for a bug which had been fixed upstream years before because that was seen as better than static linking or bundling a newer version.

1 more reply

cb3215y ago

Historical trivia: CTAN (Comprehensive TeX Archive Network) pre-dated CPAN (.. Perl ..) by about 1..3 years.

[1] https://en.wikipedia.org/wiki/CPAN

galangalalgol5y ago

We should add to note to 1, "and you aren't making lots if short calls into said library". If you are, the ffi ends up costing more than the savings.

PixyMisa5y ago

4) string concatenation

piker5y ago

Probably (1) above, no?

pjmlp5y ago

NULL terminated strings are quite bad for performance, because you need to transverse them to find the terminator.

Now try to concatenate a bunch of them in C.

3 more replies

IgorPartola5y ago

I’ve been thinking about this lately. Can you actually be faster than C? Like, in the sense that you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C (such as goto, no bounds checking, pointer arithmetic, etc.), any algorithm for say parsing JSON can be expressed equally as efficiently in C, while a clever and hacky C-specific algorithm cannot necessarily be expressed in those higher level languages.

In other words, is “faster than C” even a good metric if all it means that “if you implement something inefficiently in C it will be faster than if it is implemented efficiently in not-C”?

WalterBright5y ago

> Can you actually be faster than C?

Sure, in any language that provides more semantic information than C does. For example, D enables a "pointer to immutable data" type, while C does not. This can improve optimization.

On a pragmatic note, C makes it easy to use 0-terminated strings, and clumsy to use length-terminated strings. The natural result is people use 0-terminated strings.

0-terminated strings are inefficient because of the constant need to strlen them. When I look to optimize C, I find plenty of paydirt in all the strlen instances. D makes it easy to use length-terminated strings, and so people naturally prefer them.

teddyh5y ago

> For example, D enables a "pointer to immutable data" type, while C does not.

Wait, isn’t that what

  const int *x;

does? I.e. a pointer to a constant.

WalterBright5y ago

In C, the value pointed to by `x` cannot be changed via a write through `x`, but if there is another pointer to that value, it can be changed through that.

In D, `immutable(int)* x` cannot be changed by any reference to the value.

4 more replies

gpderetta5y ago

I think the point of the parent is that in principle you can always implement those optimizations by hand in C although it might of course be impractical.

bhaak5y ago

If you go down this road then you can always drop down to assembler to be even faster than C.

I don't think this is a reasonable argument. Every Turing compatible language that gives you direct access to the metal, so to speak, provides you with the opportunity to implement these optimization by hand.

I think it's much better to look at average C code there. And then C has a tremendous advantage with their compiler support. C compilers have decades of optimization put into them. This will take a while for other languages to catch up to that.

1 more reply

WalterBright5y ago

> in principle you can always implement those optimizations by hand in C

Some things, like integral promotion rules, you can't get around. You may know that the promotion can be skipped in certain cases, but the compiler cannot know it.

lenkite5y ago

Leading to eye-rolling problems like these: https://github.com/biojppm/rapidyaml/issues/40

rz2k5y ago

Because C is not machine code or even assembly language it must be compiled. Not only might the compiler not be as opportunistic about improving the human written code, but it might not fully utilize the machine's full capabilities if the compiler does not understand the complete instruction set that is available or other things like memory and different levels of cache.

The compiler might not even provide access to these capabilities to a programmer who knows about them and wants to explicitly use them. In such cases, the programmer might have to use assembly, or Fortran, C++ or even something much higher level than C with a compiler that provides access or knows how to "intuit" when those capabilities are useful.

tachyonbeam5y ago

Compiler engineer here. In practice, compilers for higher-level languages often have a lot of difficulty getting anywhere close to the efficiency of comparable C code. If you take Python, for example, you have to do a lot of inlining to eliminate various abstractions. Inlining is actually non-trivial. Yes, inlining, by itself, is an easy program transformation, but knowing where to inline to get the best performance is very hard. If you inline too much, you increase code size, and you lose performance. You have to know precisely which inlining decisions will pay for themselves, and your large codebase might have tens of thousands of call sites and a call hierarchy 40 functions deep. Python also adds the fun little problem that you can redefine any function at run time, which makes it hard for the compiler to know which function you're actually going to be calling. To complicate things further, inlining decisions affect other optimizations. For example, if you inline foo into bar, do you then also inline into bar the functions that foo is calling? Do you unroll the loops from foo into bar? Etc.

Also, there's an aspect that I feel is constantly overlooked, and this is the way that objects are laid out in memory. In Python, JavaScript, Ruby, you have a lot of pointers to objects. You get a lot of pointer-chasing as a result. This is BAD for caches. Each object you touch, each pointer you dereference means pulling in a new cache line. In C, you can design very compact and flat data structures. You can use 32-bit floats, 8-bit, or even one-bit integers if you want. You can have a struct within a struct, with an array inside of it, all without any pointer-chasing.

Modern CPUs are constrained by memory bandwidth, and it's very hard for any programming language to beat C on achievable memory efficiency. What's worse is that we have little to no academic compiler literature on automatically optimizing for memory efficient data layouts. It's an overlooked and understudied problem. You would have to prove that integer values lie within certain ranges (hard) and also do object inlining (collapsing objects into parents) which AFAIK is also hard and not done by any mainstream compiler.

So, yeah, keep thinking that a sufficiently-smart compiler will do everything for you. You will assuredly be very disappointed. Until we have strong AI, the sufficiently smart compiler is basically unobtainium. If you want efficiency, the best route is generally to have less layers of abstraction, or to only rely on compiler optimizations you know for certain will happen.

cbsmith5y ago

The sufficiently smart compiler is effectively unobtanium, but that presents a challenge for C as well. C has its own challenges with its memory model & language semantics.

C never was the lowest level of abstraction; there are other abstraction models out there and more still to be invented no doubt. C's model did well (though struggled mightily against Fortran for the longest time) at aligning with processor models of the 80's and 90's. It helps that C's ubiquity meant that to a degree the trail was washing the dog: processor designs were often measured against how they executed code compiled with a C compiler. But past preformance is a poor predictor of future success; who is to say that as processor designs continue to evolve, C's abstraction model won't become increasingly leaky against another? Absent a sufficiently smart compiler, it's entirely possible that C compiler writers will find themselves at a disadvantage.

And that assumes they're competing with a traditional compiler. It's possible, though unlikely, there will be competition from other execution models. As you said memory is often the core area of performance bottlenecks these days, and as terribly inefficient as bytecode interpreters are, they tend to have smaller code sizes. Efficient interprets are hand tuned specifically to make the overhead of bytecode interpretation as efficient as possible. Now, intrinsically they are performing a translation step at runtime that a compiler did before runtime, but one can at least theorize of a model where the interpreter is effectively a specialized decompression algorithm that feeds machine code to the CPU (that's really not that far afield from what happens in hardware in modern cpus and in mobile runtimes). Higher levels of abstraction might allow for more efficient decompression... It's crazy, but not inconceivable.

2 more replies

josephg5y ago

Well said.

I've had a lot of conversations with javascript engineers over the years who've argued to me that well tuned JS will be nearly as fast as the equivalent C code. I've written plenty of little toy benchmarks over the years, and in my experience they're partly right. Well written JS code in V8 can certainly run fast - sometimes around half the speed of C code. But a massive performance gap opens up when you use nontrivial data structures. Nested fields, non-uniform arrays, trees, and so on will all cripple javascript's performance when compared to C's equivalent of simple embedded nested structs. If you couple clean C data structures with allocation-free hot paths from arenas, the performance of C will easily hit 10-20x the performance of the equivalent JS code.

From memory my plain text based operational transform code does ~800k transforms / second in javascript. The equivalent C code does 20M/second. The C implementation is about twice as many lines of code as the JS version though.

(The code in question: https://github.com/ottypes/text-unicode/blob/master/lib/type... / https://github.com/ottypes/libot )

1 more reply

lmm5y ago

C doesn't tell you anything about your cache efficiency, which is to first order the only thing that affects your program's performance.

You're right that flat datastructures are important, but C is far from the only language that can offer those.

I don't think C can ever be the answer; "comparable C code", i.e. code that was produced with the same amount of developer effort, is almost always undefined behaviour that happens to work most of the time on the developer's compiler. C doesn't have the kind of polymorphism that you need to write code that's compositional, testable, but also has good cache locality, at least not for code that also has to interact back and forth with the outside world. There is at least some work in the literature on making efficient use of cache in a Haskell/ML-like evaluation-by-reduction model, and I think that's where the long-term way forward lies (and FWIW in my experience compilers for ML-family languages already more than smart enough).

1 more reply

jeffreygoesto5y ago

Thank you for the good explanation! I'm in embedded and my experience is that this stuff is almost totally ignored in universities. People have to go through a steep and (for us) expensive learning curve until they get the feeling for the whole system and what makes it fast. And then I think, is it asking for too much? Can we expect knowledge about a gazillion layers smeared onto each other from everybody who just wants to deliver value from the topmost layer?

Const-me5y ago

> Inlining is actually non-trivial.

OTOH, JIT runtimes have more input data than a C compiler. They can implement some runtime equivalent of C++ profile-guided optimization: measure what actually happens in runtime, assume the input data is going to stay roughly the same, and re-generate machine code with this new information into something more efficient.

Pretty sure modern Java does that sometimes.

> In Python, JavaScript, Ruby, you have a lot of pointers to objects.

In C# you can make data structures which don’t use pointers much, or at all. The language is strongly typed, has value types, it’s easy to place structures inside other structures. There’s a native stack just like C, and even stackalloc keyword which is an equivalent of alloca() in C.

4 more replies

MaxBarraclough5y ago

> You would have to prove that integer values lie within certain ranges (hard)

Don't JavaScript JITs rely heavily on this to reduce JavaScript's floating-point arithmetic to integer arithmetic?

Not to say it's easy, but hasn't a lot of work been done on this?

1 more reply

btilly5y ago

FORTRAN is faster for many tasks, and is probably more popular in high performance computing.

Also tasks that can be moved to the GPU go a lot faster. You can interact with those programs in C, but not natively. But some languages, like Julia, can easily move calculations to/from the GPU. And also can transparently take advantage of parallelism.

Julia is in the process of growing rapidly for high performance computing. I don't know if it has officially passed C there. But if not yet, it will.

Leo_Verto5y ago

To add onto your point: from what I've heard the split at HPC conferences is about 40% C to 60% Fortran.

SubjectToChange5y ago

In my limited experience, almost every language specific presentation at an HPC conference is about C++. New projects and initiatives like DPC++, SYCL/oneAPI, Ginkgo, etc are C++ based. And if you look at the open source libraries of the national laboratories, e.g. LLNL, C++ libraries are more common than C libraries, which in turn are more common than the Fortran libraries.

If anything I'd say that C++ is the default for HPC conferences.

btilly5y ago

That sounds right.

I wonder what the current split is. I'm not in HPC computing, but I see a lot of posts about how quickly Julia is being adopted.

not2b5y ago

Python users who use numpy and scipy are actually using a lot of Fortran code.

mhh__5y ago

C - and especially idiomatic C actually withholds a __lot__ of information from the compiler. People like to think C is just a portable assembler, but by god it ain't.

The reason why C programs often are fast or run faster is because the language forces you to approach almost all abstraction head on - this goes both ways however, in that linked lists are much faster to write in C than a safe array type, so programs can end up with bad data structures for too long.

Since the lingua franca of the operating system is still C or C-like, there are actual optimization opportunities that go missing because of the C interface: It's hard to devise many alternatives, but if you are calling into libc for mathematics in a hot loop it may be worth avoiding it and rolling your own since the compiler can't really inline libc calls.

coldtea5y ago

>C - and especially idiomatic C actually withholds a __lot__ of information from the compiler. People like to think C is just a portable assembler, but by god it ain't.

Well it might not be, but the fact that it "withholds a lot of information from the compiler" is an argument in favor of it being (a portable assembler), not the opposite.

lmm5y ago

> Like, in the sense that you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C (such as goto, no bounds checking, pointer arithmetic, etc.), any algorithm for say parsing JSON can be expressed equally as efficiently in C, while a clever and hacky C-specific algorithm cannot necessarily be expressed in those higher level languages.

This isn't true though. Just off the top of my head: C doesn't expose the carry flag directly, C doesn't let you write signed arithmetic using twos-complement overflow, C doesn't let you share a value between separate compilation units without allowing it to be mutated...

"faster" or "slower" is a pretty terrible metric anyway, given that most projects do not have infinite developer time available. But it's very hard to benchmark "equivalent developer effort" between two different languages.

wvenable5y ago

C loses some optimization potential because of aliasing; it can't be guaranteed that some pointer operation isn't going to change a variable anywhere in memory so that limits certain reordering and loop optimizations.

Languages without arbitrary pointers don't have this issue and safely assume they knows all the assignments done to values in memory allowing for optimizations.

kevin_thibedeau5y ago

The restrict keyword helps with this.

m_mueller5y ago

That, or switching to FORTRAN. For numeric / HPC it is much easier to get a decently fast implementation there.

pjmlp5y ago

Provided their user knows what they are doing and never do the mistake to lie to the compiler.

coliveira5y ago

The C language has provided the restrict keyword to solve these aliasing issues since the C99 standard. That is, 20 years ago.

tachyonbeam5y ago

I think this generally doesn't have much of an impact on performance.

2 more replies

nynx5y ago

Be careful when you say that everything can be converted to C. While that's true in a naive fashion, it gets rid of a lot of additional information.

For example, Rust can sometimes beat C because it's (1) often more friendly to auto-vectorization and (2) it has additional aliasing information.

kevin_thibedeau5y ago

C code that makes heavy use of callbacks in inner loops is at a disadvantage compared to languages with more powerful inlining facilities. Compare qsort() to std::sort().

kllrnohj5y ago

Easily. C++ has ways of being faster that C can't really match - namely, templates. They can be hellish to write & debug, but generating type-specific functions is fantastic for optimization & performance. It's also a lot easier to be faster in C++ on key things than it is in C, specifically small-size optimizations. Yes you can do an SSO string or function pointer in C, but it's hard & painful to do so, so it's rarely if ever done. But it's trivial to do in C++, and since the standard library does it for both strings & functions, it's also commonly done.

Similarly in languages like Java or C#, having first-class exceptions means fewer branches & error checking on the hot path over something like C. They are on the whole slower than C for other reasons, but it's not because C is "the best" or "the fastest" at everything. And of course you can't really do de-virtualization optimizations in C.

cb3215y ago

This seems oriented around just code bases you have seen personally rather than fundamental C limitations and so misleads. To clarify: C can do template-like macros for type specialization (and this is common in some code bases) and easier to debug ways [1] { I've personally used a variant of [1] techniques since the mid 90s }. C can do setjmp & longjmp exceptions and you can even do a TRY/CATCH like exception macros. Function pointers are in stdlib interfaces (e.g. qsort) and generally pretty common in C code. I believe gcc has been able to inline through C function pointers ("de-virtualize", though you qualify with "really" there..) in the same translation unit for like 2 decades now with the right compiler flags/annotations.

It is true that C cannot do full template meta-programming or certain state save/restore optimizations faster than setjmp/longjmp. Hard/painful/trivial are all a bit more subjective and relative to exposure/practice/training/etc.

Personally, I think Nim [2] is a much friendlier way to go than either C++ or Python without the pitfalls of either, but the ecosystem is admittedly much smaller. Also, I've yet to hit something where re-writing the C++/Rust in Nim did not speed things up from "at least a little" to "quite a bit". { This is more an artifact of "performance incautious programming" in the C++/Rust. Too many people fall into the trap that, since a language has a reputation for speed, they don't need to think. This is probably largely why @mehrdadn's original article had the title it did. ;-) }

[1] https://github.com/glouw/ctl/

[2] https://nim-lang.org

kllrnohj5y ago

> C can do template-like macros for type specialization

Of course, C++ has those some macro capabilities. But macros are quite limited, and typically the "template-like" ones rely on non-standard preprocessor support like typeof for swap. But then you still lack the ability to specialize swap for different types (eg, you can't replicate std::swap's behavior on std::vector in C)

> C can do setjmp & longjmp exceptions and you can even do a TRY/CATCH like exception macros.

You can, but that's now another parameter to pass along down the call stack, and as LLVM notes https://llvm.org/docs/ExceptionHandling.html#setjmp-longjmp-... it still negatively impacts the non-exception performance path.

1 more reply

Measter5y ago

> [..] Like, in the sense that you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C [..]

How would you write this (admittedly contrived) Rust function in C without invoking UB:

    pub fn foo(a: &mut i32, b: &mut i32) {
        let (new_a, new_b) = a.checked_mul(*b)
            .map(|new_a| (new_a, b.saturating_sub(*a)))
            .unwrap_or((10, 20));

        *a = new_a;
        *b = new_b;
    }

For those unfamiliar with Rust, it multiplies a by b, and if it didn't overflow:

* a = a * b

* b = b - a, saturating at the minimum value (as in, it won't wrap it just stops there)

If it did overflow:

* a = 10

* b = 20

And finally, does it compiler better: https://godbolt.org/z/66n8W9

lorenzhs5y ago

Something like this? https://godbolt.org/z/6nod5e. It even produces almost identical assembly.

The equivalent to checked_mul is __builtin_mul_overflow, which is a compiler builtin: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.... Similarly, saturating_sub seems like it can be implemented with __builtin_sub_overflow.

Measter5y ago

Would those work on any compiler, or are they compiler-specific? If those are non-standard compiler-specific extensions, not even a library, is it truly a part of C++?

While I'll grant that Rust only has one fully functional compiler at this time, those functions have been part of Rust's corelib since 1.0. Any Rust compiler would have to support them.

1 more reply

xrisk5y ago

Rust already does better than C for some benchmarks. See: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

coliveira5y ago

As the paper showed, even Python can be better than C at carefully selected tasks.

moonchild5y ago

No, the paper showed that use of a particular ecosystem default in c++ led to algorithmically worse performance then use of an ecosystem default in python.

You can always beat c if c happens to implement a bad enough algorithm and your n is large enough. That's no guarantee that the performance ∆ will continue to hold once the two language implement the same algorithm.

junon5y ago

I've thought about this very question for years. My answer has been, usually, "no" for all cases that do not allow the developer to write straight assembly.

That is, until things like C++'s stackless coroutines came about, which are a construct intrinsic to the compiler and not functionality directly exposed by C.

Further, any machine code language is going to allow you specific instruction access that a compiler might not otherwise utilize (rare, but it happens). In such cases you can gain 'manual' speedups over what C could allow you to do. I would hope that is the obvious exception, however.

But you are asking a very good question not a lot of developers are willing to think much about.

flohofwoe5y ago

IME "faster than C" without going into a lot of details what manual optimizations have gone into a specific piece of the C code doesn't mean much. Writing C code doesn't automatically give you high performance, but it gives you quite a bit of wiggle room to manually optimize the code.

In my mind, different languages have different "optimization sound barriers", where further manual optimization becomes impossible, or at least runs into diminishing returns. For instance Javascript can be made nearly as fast as "vanilla" natively compiled C code, but that means writing asm.js by hand, which is entirely different from "idiomatic Javascript". It's not worth it writing asm.js by hand, when C compiled to asm.js (or WASM nowadays) is much more maintainable.

Same with garbage-collected languages in general. You can write fast code with very little overhead from the garbage collector, but this means you need to know exactly how the garbage collector of that language works, and the resulting code will usually look very different and will be harder to maintain than the idiomatic style of that language.

In standard C the optimization sound barrier is a bit further away than in many other high-level languages but it's not all that special, on the other hand C with compiler- and CPU-architecture-specific extensions is pretty good (e.g. extensions can push the sound barrier quite a bit further away).

Symmetry5y ago

Well, generally optimized numerical libraries like BLAS are faster than what you write yourself. But a prolbem with calling them from C is that you don't get loop fusion, each call has to go through the data before the next one starts. That isn't true with Haskell's lazy evaluation so it can be faster in some cases.

Of course, C++'s eigen does loop fusion for you with template magic so to go really fast you probably want that.

Const-me5y ago

> Can you actually be faster than C?

Sometimes. I know 2 reasons.

1. In some higher-level languages, some problems can be efficiently solved with code generation. For instance, you can take some of the input data, generate code from that, then run the generated code processing the rest of the input data. Examples of the mechanism include Lisp macros, or .NET shenanigans like System.Reflection.Emit or System.Linq.Expressions.

It’s hard to generate good machine code. Possible to do in C, but you gonna need to embed C compiler for that. They are extremely complex, and were not designed for the use case e.g. relatively slow (designed to run offline on fast developer’s computers, or on even faster build servers).

If your problem is a good fit for that approach, C# can actually be faster than C.

Parsing JSON is a good example. When your goal is not just parsing the syntax, but also populating structures with the data from the JSON, good C# JSON parsers gonna runtime-generate serialization code by reflecting the structure types being populated. For an isolated program, technically you can generate equivalent C code offline. But if you’re making a general-purpose JSON serialization library that’s borderline impossible to achieve in C, need reflection and code generation.

2. Modern computers are heterogenous, they have 2 chips computing stuff, CPU and GPU. Even when these pieces are on the same chip and using the same RAM like Intel’s UHD graphics or AMD’s APUs, GPU can be still faster for some problems. Not only because more GFlops (that’s not necessarily true for integrated graphics), also because GPUs have better strategy for dealing with RAM latency. They switch threads instead of waiting for the data to arrive. CPU cores only run 1-2 hardware threads each i.e. limited in that regard.

That’s how for some problems HLSL or OpenCL can actually be faster than C.

jcelerier5y ago

> They are extremely complex, and were not designed for the use case e.g. relatively slow (designed to run offline on fast developer’s computers, or on even faster build servers).

in 2021 bundling clang along with your program is actually reasonable - if you are compiling small functions without two tons of headers it's measured in milliseconds.

Const-me5y ago

I never tried to, but I think integration of runtime-generated native code gonna cause overhead.

Where do you place these functions once compiled? Into a separate DLL/each?

In .NET it’s quite easy to generate code that calls manually written functions, or access data provided by manually-written stuff. JIT runtime doesn’t treat generated code as something special, e.g. may inline calls across runtime-generated and manually written pieces of the program. With clang you gonna need a layer of indirection to integrate, with function pointers and such.

1 more reply

Blikkentrekker5y ago

> Can you actually be faster than C? Like, in the sense that you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C (such as goto, no bounds checking, pointer arithmetic, etc.),

I'm not sure why you think that, especially with Rust one of the reasons it can be theoretically faster is because it allows for even more undefined behavior and has a variety of datatypes that simply exist to inform the compiler of potential optimizations.

TylerE5y ago

Sure, theoretically.

For instance, C function calls always do some things (like preserving the stack) that may not always be necessary.

I sure wouldn't want to to TRY to beat a C compiler, but it seems obvious to me that it is possible.

thaumasiotes5y ago

> you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C (such as goto

Turing-completeness tells you that it is in fact necessarily true that you can convert any bit of C into a corresponding bit of Python, Lisp, or Haskell. One obvious approach would be to emit code that implements a C runtime.

For goto in specific, you don't even need to do that. You don't need a goto keyword to implement goto functionality.

IgorPartola5y ago

Let’s be careful here. All these languages are Turing complete. Heck, isn’t CSS Turing complete now? But insofar as goto in C produces a certain small amount of CPU instructions (1? 2?), can all those other higher level languages do the same? Or will something like JavaScript need a callback-based solution that will do several pointer lookups, memory allocation, etc?

jcelerier5y ago

> One obvious approach would be to emit code that implements a C runtime.

how could you negate the python interpreter startup time though ?

thaumasiotes5y ago

Are we talking about compiling or performance? You negate the Python interpreter startup time by changing your implementation of the Python binary, but that isn't relevant to the problem of producing compiler output in the Python language.

majjgepolja5y ago

C also does lack few features which can provide better performance or provide hints to the compiler.

computed goto, non-aliasing guarantees, actual "const" - just a few on top of my mind.

Generic code is generally faster in C++ unless you manually reimplement it in C.

Rust can be faster than C in few cases because stronger aliasing guarantees, but few compiler bugs prevent it.

Doing some good-performance stuff is also more difficult in C. Strings are null terminated etc..

bqmjjx0kac5y ago

Sure, managed languages can beat C performance. IIRC, the JVM can reorder branches on the fly to avoid jumps.

pkolaczk5y ago

Reordering or hinting branches has virtually zero impact on branch prediction performance on modern architectures with advanced hardware branch prediction.

What's much more important for good performance is memory layout and good use of CPU caches. And in this area managed languages struggle a lot. For instance, every object in Java has 16 bytes of overhead for an object header (on 64-bit openjdk jvm). Or you can't organize objects in a flat array. Or there are some guarantees about memory zeroing which often lead to needless memory writes. Or you have to live with GC, which often wastes a lot additional memory and regularly brings unused but reachable memory to caches. Project Valhalla is going to improve some of these limitations hopefully some day, but don't expect the level of C, C++ or Rust performance.

limaoscarjuliet5y ago

Per Jim Keller, current source of speed in CPU is better branch prediction and data fetching prediction. It is possible C++ (or some other lang) may have more met information to drive these better, in which case it will be faster than C/assembler.

https://www.youtube.com/watch?v=G4hL5Om4IJ4

tylerhou5y ago

If you use the normal toolchains for C vs another language, the other language still can be faster because it can dynamically generate bytecode based on runtime information. In the normal C toolchain, this is not possible.

However, if you used an alterative toolchain that could also generate bytecode from C at runtime, then I would bet that C would stay on top or be equal.

thysultan5y ago

You can literally emit runtime assembly in C in any C toolchain, what are you on about?

tylerhou5y ago

How do you dynamically generate machine code without linking something like LLVM?

2 more replies

offtop55y ago

I think about programming languages like I do cars. While rust may be a Ferrari, some kid who doesn't know how to drive a stick and has only had his license for about a month is going to have a rough time beating you driving from New York to Texas even if you're driving a Corolla. I consider myself to be an extremely good software engineer, but lower level programming languages scare me. Recently I've been riding a lot of code in C++ for my Arduino project, but aside from that I can't stand it. You have to be such a good C programmer to make things faster than you probably could knock out using Python + numpy or something like that, but to be fair under the hood numpy uses tons of optimized C code.

sgtnoodle5y ago

I mainly write embedded software in C and C++, but I tend to use python for non-embedded stuff. The funny thing is that I've attempted to speed up multiple python scripts by replacing lists of floats with numpy arrays, and each time it's ended up slightly slower. I suspect numpy only really starts to pay off when you're doing the same operations to 1000+ elements. For modest data sets, or algorithms that don't map particularly well to numpy's operations, the built in data types do better.

I also recently gave numba a go, and it was significantly slower than vanilla python. I was surprised because the decorated function seemed exactly like the type of code that numba would be good at (iterating through a list of coordinates doing square roots and some quirky floating-point modulo math.)

ColFrancis5y ago

Just checking, even though you likely know. The numpy arrays really only pay off when you vectorise your calculations. Don't expect any speed up if you're still using list comprehensions.

A vector operation is just where instead of having two lists, U and V which add to make W = [x+y for x,y in zip(U,V)] you directly operate on the arrays, W = U + V.

This allows the inner loop to be completely in native code. Sometimes some curious things happen where it's easier to do more vector operations resulting in more looping just because each of those operations is a vector operation and so is faster than the loop. For example, increment a number every time you see a particular element (time series breaking out subsequences) might look like np.cumsum(U == value) which is two loops in practice, but much faster than the iterative approach.

condret5y ago

Interesting for me to see someone being scared of lowlevel langs. For me it's the opposite, highlevel langs scare me. They always make me feel that I don't know, what is actually going on

offtop55y ago

I don't know how cars work , I still drive.

nneonneo5y ago

Hang on, this isn't accidentally quadratic, this is accidentally exponential in the worst case. If you set the tree branching factor to one, then the running time of C++‘s std::less becomes 2^h, but there’s only h nodes.

Demo: https://ideone.com/CBIEAE

This creates ~60 objects, but takes something like 2^30 operations to resolve (it times out on this online runner, and takes around 5s on my laptop with -O3).

That's much worse than claimed in this paper! An accidentally-exponential algorithm is the kind of thing that makes DoS attacks trivial...

mehrdadnOP5y ago

You're right!! Thanks for pointing this out! I indeed tried to hint at the DAG case in the footnote, but didn't try to explore what happens when the DAG degenerates to a linked list. The biggest obstacle here is, honestly, motivating that an exponential slowdown is a real-world issue that concerns the average programmer, because I know for sure that many would immediately blame the data structure and tell you it's your fault for designing it that way. :-) Whereas if I can illustrate a problem that's fundamentally ingrained into language's built-in data structures by design, and that people would actually encounter in the real world, that's far more compelling.

I know this because I already received such criticism for my examples, with people telling me that it's unclear how wide-reaching the ramifications are in real-world applications (which I suppose is fair enough). People really want to see examples of real-world software being improved with such small tweaks in the algorithms, whereas I didn't have the bandwidth to try to investigate that, and just tried to settle for plausible examples. That criticism would be magnified many times further for DAGs and (especially) degenerate linked lists. (I'm not claiming these are the only relevant scenarios by the way—just saying this is how it is likely to be seen by many readers.) I moved on and didn't spend more time on this (it was kind of a random paper idea I had and not related to anything else), but I think it would be awesome to flesh this out further into something more interesting and compelling and properly publish it.

If you find this interesting and have the time to help joint-author & actually publish a paper on this, please grab my email from arXiv and email me! It would be great to flesh out more consequences and find more interesting examples together. I feel like there's a lot more underneath the surface that I might not have touched (both theoretically and practically), but I hadn't managed to gather enough interest from others in the topic (let alone the examples) to motivate me to look further until now!

leecarraher5y ago

This is a pretty clickbaity title in my opinion. Bubble sort in lower level language X is going to be slower than quick sort in high level language Y. And bubblesort in high level language is going to be faster than merge sort for small data sets on low level language X. If you aren't comparing the same underlying routine, or data application, I don't think any conclusion should be made. Comparisons between languages is exactly why asymptomatic analysis was devised. Extract the process from the low level and hardware characteristics, and get the overall complexity growth. But this doesn't work the other way around. You can't compare different routines in different languages and expect big oh to be comparable.

brundolf5y ago

The problem is that a language isn't just its speed, a language is a UX. If that UX makes it easier to accidentally make performance mistakes, then practically speaking, things written in that language are slower than they would be otherwise.

Edit: The original title "Why Python is faster than C++" is much more clickbaity than the editorialized ("When Python is faster than C++")

frankus5y ago

Here’s an article from 2001 discussing a very similar issue: https://www.joelonsoftware.com/2001/12/11/back-to-basics/

derriz5y ago

Am I missing something? In the paper the definition of cmp3 on page 2 seems to have a bug - as defined, wouldn't cmp3([1,2,3], [1]) return 0?

    # Uses 3-way cmp() for primitives
    def cmp3(a, b):
        if not isinstance(a, list):
            global c; c += 1
            return cmp(a, b) 
        for x, y in zip(a, b):
            r = cmp3(x, y)
            if r != 0: 
                return r 
        return 0

This isn't generally the expected behaviour for comparing lists, surely?

mehrdadnOP5y ago

Oh yeah I think you did find a bug, thanks for pointing it out! I need to check the lengths as well. It shouldn't affect the conclusion (in fact I think I made all comparisons equal-length in the paper?) but I should revise it when I get the chance.

rav5y ago

Seems like you can fix this by changing return 0 to return cmp(len(a), len(b)).

froh5y ago

If I read the paper correctly, then it compares three-way comparison with two two-way comparisons, for a recursively defined (tree) data type.

The paper points out that the convenience of just defining "less than", and heaving "equals" derived from that can be costly. Specifically, for the recursively defined data structure (tree), a three-way comparison which is derived canonically from the two way compare seems to entail not a linear but an quadratic number of comparisons.

What I don't understand is what is happening in 'lt2'.

this is what I'd expect for __lt__

  lt(a, b) also known as (a __lt__ b) is returning
    True, iff a < b, elementwise for lists
    False otherwise
  for same length lists.

I also do understand cmp2.

  (a __eq__ b) iff not (a __lt__ b)  and not (b __lt__ a)

so looking at

  cmp(a, b) = lt(a, b) - lt(b, a)

I get

  a < b: 1 - 0 ==> 1
  b < a: 0 - 1 ==> -1
  a == b: 0 - 0 ==> 0

which makes sense.

Now two questions arise with respect to the presented hypothesis and the paper:

1. why does the paper call lt2 twice, recursively?

2. why does the paper compare the performance of their lt2 and lt3 instead of the performance of cmp2 and cmp3?

I intuit, when taking the double recursion out of lt2, which imnsho is erroneous, and when comparing cmp2 and cmp3, we'll see a performance penalty of a factor 2, between cmp2 and cmp3, and identical run times for lt2 and lt3, as it should be.

What am I missing?

edit: updated for clarity

attractivechaos5y ago

Initially I thought this preprint is a click bait (sorry, author), but when I read into details, I realize it is an interesting one. The key observation is the code under section 2.4.2. There, the author triggers the worse case (everything being equal in two trees) and shows that C++'s lt2 behavior leads to its horrible performance: 4.1s in C++ vs 0.018s in Python. Note that the difference is much more than two folds as lt2 and lt3 have different time complexity. PS: after a quick thought, I am not sure if we can avoid the quadratic behavior of lt2 in this particular example.

mehrdadnOP5y ago

Haha, thank you! It was pretty demotivating to see so many people immediately dismiss it as clickbait without any attempt to discuss the topic at all, so it actually means a lot to hear that you think differently now. I hope it was fun & worth the read. I know I had a lot of fun writing it. :)

mehrdadnOP5y ago

I think the sibling comment may have already answered your question, but if not, I think an earlier response I had might help clarify what exactly I'm comparing & why: https://news.ycombinator.com/item?id=26340952

Note that you do need to read the entire paper to see the overall picture and its consequences; if you stop halfway then it might appear like I'm comparing things unfairly. Feel free to let me know if anything is still confusing after reading the other comments and taking another look at the paper and I'll try to clarify!

froh5y ago

let me rephrase, as I indeed have read the paper, before posting.

1. the lt2 definition in the paper is wrong. A lexicographical compare is linear in the size. the derived cmp2 is correct and has a run time twice that of cmp3. which matches the stl definitions of lexicographical_compare, see below.

2. the c++ behaviour in 2.4.2 is puzzling and most likely bug, worth reporting to and discussing with STL implementors.

https://www.cplusplus.com/reference/vector/vector/operators/

http://www.cplusplus.com/reference/algorithm/lexicographical...

mehrdadnOP5y ago

> 1. the lt2 definition in the paper is wrong.

Would you mind providing a counterexample to illustrate what incorrect output it's producing?

> A lexicographical compare is linear in the size.

Indeed, lt2() also has a loop that iterates a linear number of times as you mention. It is consistent with this.

> the derived cmp2 is correct and has a run time twice that of cmp3. which matches the stl definitions of lexicographical_compare, see below.

Perhaps you might be confused about what lexicographical_compare does? It does not "compare" in the 3-way sense. It only performs a "less-than" comparison. The name is rather misleading.

2. the c++ behaviour in 2.4.2 is puzzling and most likely bug, worth reporting to and discussing with STL implementors.

I'm not sure what to report to anyone, as I don't find it puzzling; it is entirely consistent with everything I'm aware of and have tried to explain in the paper. It is also not specific to any particular implementation; I believe you will see this behavior on any correct implementation you try it on. It might be helpful if you try to produce a counterexample using what you believe would be a correct & efficient implementation to validate or invalidate this.

brundolf5y ago

Reminds me of the sscanf thing that popped up a few days ago (in fact I assumed this was about that at first): https://news.ycombinator.com/item?id=26302744

I wonder (genuinely asking, not being snarky) what it is about C/C++ that seems to make these issues more common? It's also possible my perception of "more common" has just been inflated by seeing multiple examples in a single week

bobbylarrybobby5y ago

See this comment and children: https://news.ycombinator.com/item?id=26340233

brobdingnagians5y ago

Very interesting. I think it makes C++ come across looking quite good. The committee has considered a case where C++ comparisons are lacking, and C++20 rectifies the situation by having more expressive comparisons with partial and total orderings. C++ may be slower than other languages in doing innovations, but they move relentlessly forward with trying to adopt a good way of accomplishing new features & rectifying mistakes from the past.

mehrdadnOP5y ago

C++20 doesn't quite rectify this unfortunately! The data structures still use std::less even in C++20, meaning the 2-way comparisons would happen twice. Except now each 2-way comparison is potentially implemented on top of 3-way comparisons. Perhaps I or someone else should get in touch with the committee to try to change that, otherwise things are going to be slower rather than faster if we still use 2-way comparisons but now they have to do 3-way comparisons underneath!

ac130kz5y ago

The code doesn't compile, i.e. is in the broken state. The author is missing the comparison operators.

https://godbolt.org/z/5EWhec

lorenzhs5y ago

Why do you assume -std=c++20? The code is written for C++17, and compiles just fine for that.

halayli5y ago

How can this paper be taken seriously when the paper doesn't even show the compilation flags?

flohofwoe5y ago

Once you go accidentally quadratic, any clever combination of optimization flags or compiler magic becomes quite irrelevant though.

josalhor5y ago

To be fair, certain compilation flags can change the time complexity of some algorithms if the mistake is trivial enough to figure out.

Lvl999Noob5y ago

Really? Can you give some examples? I know compilers are amazing but this seems too much.

2 more replies

coldtea5y ago

Perhaps because the paper is not about the specifics of the incident (which is menial work to figure out, left as an exercize to the header) but the fact that it can happen and a general explanation of the circumstances?

david2ndaccount5y ago

Have academic CS articles always had click-bait titles?

adgjlsfhk15y ago

Goto considered harmful is from 1968, so yes.

schot5y ago

A minor historical note on this: the original title of Dijkstra's text was "A Case Against the Goto Statement" and it was the Communications of the ACM editor (Niklaus Wirth) that changed it[0].

[0]: https://en.wikipedia.org/wiki/Considered_harmful

bregma5y ago

In 2021 the note would probably have been headlined "Academics hate programmers who use this one cool trick!!"

DarkWiiPlayer5y ago

"goto considered harmful" was at least not wrong. It didn't say who considered it harmful, but at least it was still presented as an opinion.

This title states "Python is Faster Than C++", which neither implies that this is just an opinion, nor that it isn't an absolute statement. You have to figure out yourself that it's probably hyperbolic and just referring to special cases.

redeyed5y ago

Not "Python is Faster Than C++" but "When Python is Faster Than C++"

unishark5y ago

I have noticed this a lot too. I'd guess is has to do with the importance of conferences over journals in CS (as opposed to nearby fields where journals dominate). Conferences tend to be more informal and forgiving of titles like that.

Of course this is just a preprint. If they ultimately publish it somewhere the editors/reviewers may make them give a more conservative title.

globular-toast5y ago

Why are we back to learning basic computer science? This isn't news to anyone here is it?

bregma5y ago

Most people never learned computer science. Few programmers I've met have.

acdha5y ago

It's also not like having a CS degree makes you immune to this — people are busy, real software has indirection which can make the implications non-obvious, and for many working programmers an awful lot of what they learned was a long time in the past and not something they use on a daily basis.

The C++ code shown is a great example: when you see very simple code in the middle of a paper talking about how a particular patterns fails badly, yes, you're primed to look for a problem but if that showed up for you in code review are you really confident that you'd say something other than “followed standard practice, maybe add braces around the if statement”?

The real lesson here is that nothing beats actually measuring your code to make sure you didn't miss something like this.

0823498723498725y ago

I once implemented a regex matcher in a SQL dialect. I forget exactly how large the "pathological" expression was that it could beat perl's C implementation for matching, but I'm pretty sure n was less than 30.

zo15y ago

We lost that battle many years ago. Is that news to you?

j / k navigate · click thread line to collapse

213 comments

mehrdadnOP5y ago

Anyway, hope people enjoy reading it and take something useful away from it. :)

nivenkos5y ago

By the way for the hashing example in count() - Rust has the Entry API for HashMaps to avoid exactly this problem: https://doc.rust-lang.org/std/collections/struct.HashMap.htm...

The PartialOrd trait also uses 3-way comparisons so I think the other issue is mitigated too, but it'd be interesting to check: https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html#tyme...

dig15y ago

I'm failing to understand some things, maybe because I glanced through the paper with morning coffee. Apology if my tone ends up a little bit harsh, but these are just constructive criticism :)

mehrdadnOP5y ago

attractivechaos5y ago

1 more reply

einpoklum5y ago

1. People often use set instead of unordered_set (and same for map) despite not needing order. This slows things down.

2. The C++ standard library's maps and sets are known to be rather slow. See, for example:

https://stackoverflow.com/q/42588264/1593077

when you have string values, it's even worse, as you describe. But it's not clear that an overly-clever implementation, which caches numeric ranks of strings etc., is a good idea to have.

mehrdadnOP5y ago

kllrnohj5y ago

If performance is a concern then you should still avoid std::set though by default. Logarithmic worst case when it's just always slow isn't really useful.

There may be a benchmark out there where std::set can beat std::unordered_set, but you'll be really hard pressed to find it

1 more reply

shultays5y ago

4 more replies

cesarb5y ago

> People often use set instead of unordered_set (and same for map) despite not needing order. This slows things down.

vlovich1235y ago

Depends what’s your definition of “quite new”. That’s over 10 years go now :)

asdfasgasdgasdg5y ago

They've been in the language for about a decade now, and before that we had the tr1 hash_map classes that were available in most environments.

1 more reply

einpoklum5y ago

Also, in 2005 IIANM, Google published their dense and sparse hash map implementations, which were fast-ish and quite usable.

stonemetal125y ago

They were in TR1 in 2005, if I am not mistaken.

brundolf5y ago

hashingroll5y ago

mannerheim5y ago

masklinn5y ago

> Why is the default set implementation ordered in the first place?

"Sorted" rather than ordered (an order can be arbitrary and I'd personally associate the word with insertion order).

> The formal data structure is unordered

The formal data structure doesn't have complexity bounds, the STL does.

einpoklum5y ago

It's a mis-feature. And since C++ is very committed to backwards compatibility, they didn't change it later on.

If/when `std2::` happens, this is one of the things I assume would change.

a-dub5y ago

cpython is faster than c++ under the following circumstances:

1) when you're using native libraries via python that were written by better c/c++ programmers than you are and you're spending most of your time within them

2) when you're using native libraries in python that are better implementations than the c/c++ libraries you're comparing against

3) when you don't know the libraries you're using in c/c++ (what they're talking about here)

...otherwise, if you're just using doing basic control flow optimizing compiler c/c++ will almost always be faster. (unless you're using numba or pypy or something).

acdha5y ago

a-dub5y ago

but the real moral of the story: use the libraries, luke/lucia! also, know them!

acdha5y ago

1 more reply

cb3215y ago

Historical trivia: CTAN (Comprehensive TeX Archive Network) pre-dated CPAN (.. Perl ..) by about 1..3 years.

[1] https://en.wikipedia.org/wiki/CPAN

galangalalgol5y ago

We should add to note to 1, "and you aren't making lots if short calls into said library". If you are, the ffi ends up costing more than the savings.

PixyMisa5y ago

4) string concatenation

piker5y ago

Probably (1) above, no?

pjmlp5y ago

NULL terminated strings are quite bad for performance, because you need to transverse them to find the terminator.

Now try to concatenate a bunch of them in C.

3 more replies

IgorPartola5y ago

In other words, is “faster than C” even a good metric if all it means that “if you implement something inefficiently in C it will be faster than if it is implemented efficiently in not-C”?

WalterBright5y ago

> Can you actually be faster than C?

Sure, in any language that provides more semantic information than C does. For example, D enables a "pointer to immutable data" type, while C does not. This can improve optimization.

On a pragmatic note, C makes it easy to use 0-terminated strings, and clumsy to use length-terminated strings. The natural result is people use 0-terminated strings.

teddyh5y ago

> For example, D enables a "pointer to immutable data" type, while C does not.

Wait, isn’t that what

  const int *x;

does? I.e. a pointer to a constant.

WalterBright5y ago

In C, the value pointed to by `x` cannot be changed via a write through `x`, but if there is another pointer to that value, it can be changed through that.

In D, `immutable(int)* x` cannot be changed by any reference to the value.

4 more replies

gpderetta5y ago

I think the point of the parent is that in principle you can always implement those optimizations by hand in C although it might of course be impractical.

bhaak5y ago

If you go down this road then you can always drop down to assembler to be even faster than C.

1 more reply

WalterBright5y ago

> in principle you can always implement those optimizations by hand in C

Some things, like integral promotion rules, you can't get around. You may know that the promotion can be skipped in certain cases, but the compiler cannot know it.

lenkite5y ago

Leading to eye-rolling problems like these: https://github.com/biojppm/rapidyaml/issues/40

rz2k5y ago

tachyonbeam5y ago

cbsmith5y ago

The sufficiently smart compiler is effectively unobtanium, but that presents a challenge for C as well. C has its own challenges with its memory model & language semantics.

2 more replies

josephg5y ago

Well said.

(The code in question: https://github.com/ottypes/text-unicode/blob/master/lib/type... / https://github.com/ottypes/libot )

1 more reply

lmm5y ago

C doesn't tell you anything about your cache efficiency, which is to first order the only thing that affects your program's performance.

You're right that flat datastructures are important, but C is far from the only language that can offer those.

1 more reply

jeffreygoesto5y ago

Const-me5y ago

> Inlining is actually non-trivial.

Pretty sure modern Java does that sometimes.

> In Python, JavaScript, Ruby, you have a lot of pointers to objects.

4 more replies

MaxBarraclough5y ago

> You would have to prove that integer values lie within certain ranges (hard)

Don't JavaScript JITs rely heavily on this to reduce JavaScript's floating-point arithmetic to integer arithmetic?

Not to say it's easy, but hasn't a lot of work been done on this?

1 more reply

btilly5y ago

FORTRAN is faster for many tasks, and is probably more popular in high performance computing.

Julia is in the process of growing rapidly for high performance computing. I don't know if it has officially passed C there. But if not yet, it will.

Leo_Verto5y ago

To add onto your point: from what I've heard the split at HPC conferences is about 40% C to 60% Fortran.

SubjectToChange5y ago

If anything I'd say that C++ is the default for HPC conferences.

btilly5y ago

That sounds right.

I wonder what the current split is. I'm not in HPC computing, but I see a lot of posts about how quickly Julia is being adopted.

not2b5y ago

Python users who use numpy and scipy are actually using a lot of Fortran code.

mhh__5y ago

C - and especially idiomatic C actually withholds a __lot__ of information from the compiler. People like to think C is just a portable assembler, but by god it ain't.

coldtea5y ago

>C - and especially idiomatic C actually withholds a __lot__ of information from the compiler. People like to think C is just a portable assembler, but by god it ain't.

Well it might not be, but the fact that it "withholds a lot of information from the compiler" is an argument in favor of it being (a portable assembler), not the opposite.

lmm5y ago

wvenable5y ago

Languages without arbitrary pointers don't have this issue and safely assume they knows all the assignments done to values in memory allowing for optimizations.

kevin_thibedeau5y ago

The restrict keyword helps with this.

m_mueller5y ago

That, or switching to FORTRAN. For numeric / HPC it is much easier to get a decently fast implementation there.

pjmlp5y ago

Provided their user knows what they are doing and never do the mistake to lie to the compiler.

coliveira5y ago

The C language has provided the restrict keyword to solve these aliasing issues since the C99 standard. That is, 20 years ago.

tachyonbeam5y ago

I think this generally doesn't have much of an impact on performance.

2 more replies

nynx5y ago

Be careful when you say that everything can be converted to C. While that's true in a naive fashion, it gets rid of a lot of additional information.

For example, Rust can sometimes beat C because it's (1) often more friendly to auto-vectorization and (2) it has additional aliasing information.

kevin_thibedeau5y ago

C code that makes heavy use of callbacks in inner loops is at a disadvantage compared to languages with more powerful inlining facilities. Compare qsort() to std::sort().

kllrnohj5y ago

cb3215y ago

[1] https://github.com/glouw/ctl/

[2] https://nim-lang.org

kllrnohj5y ago

> C can do template-like macros for type specialization

> C can do setjmp & longjmp exceptions and you can even do a TRY/CATCH like exception macros.

1 more reply

Measter5y ago

How would you write this (admittedly contrived) Rust function in C without invoking UB:

    pub fn foo(a: &mut i32, b: &mut i32) {
        let (new_a, new_b) = a.checked_mul(*b)
            .map(|new_a| (new_a, b.saturating_sub(*a)))
            .unwrap_or((10, 20));

        *a = new_a;
        *b = new_b;
    }

For those unfamiliar with Rust, it multiplies a by b, and if it didn't overflow:

* a = a * b

* b = b - a, saturating at the minimum value (as in, it won't wrap it just stops there)

If it did overflow:

* a = 10

* b = 20

And finally, does it compiler better: https://godbolt.org/z/66n8W9

lorenzhs5y ago

Something like this? https://godbolt.org/z/6nod5e. It even produces almost identical assembly.

Measter5y ago

Would those work on any compiler, or are they compiler-specific? If those are non-standard compiler-specific extensions, not even a library, is it truly a part of C++?

While I'll grant that Rust only has one fully functional compiler at this time, those functions have been part of Rust's corelib since 1.0. Any Rust compiler would have to support them.

1 more reply

xrisk5y ago

Rust already does better than C for some benchmarks. See: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

coliveira5y ago

As the paper showed, even Python can be better than C at carefully selected tasks.

moonchild5y ago

No, the paper showed that use of a particular ecosystem default in c++ led to algorithmically worse performance then use of an ecosystem default in python.

junon5y ago

I've thought about this very question for years. My answer has been, usually, "no" for all cases that do not allow the developer to write straight assembly.

That is, until things like C++'s stackless coroutines came about, which are a construct intrinsic to the compiler and not functionality directly exposed by C.

But you are asking a very good question not a lot of developers are willing to think much about.

flohofwoe5y ago

Symmetry5y ago

Of course, C++'s eigen does loop fusion for you with template magic so to go really fast you probably want that.

Const-me5y ago

> Can you actually be faster than C?

Sometimes. I know 2 reasons.

If your problem is a good fit for that approach, C# can actually be faster than C.

That’s how for some problems HLSL or OpenCL can actually be faster than C.

jcelerier5y ago

> They are extremely complex, and were not designed for the use case e.g. relatively slow (designed to run offline on fast developer’s computers, or on even faster build servers).

in 2021 bundling clang along with your program is actually reasonable - if you are compiling small functions without two tons of headers it's measured in milliseconds.

Const-me5y ago

I never tried to, but I think integration of runtime-generated native code gonna cause overhead.

Where do you place these functions once compiled? Into a separate DLL/each?

1 more reply

Blikkentrekker5y ago

TylerE5y ago

Sure, theoretically.

For instance, C function calls always do some things (like preserving the stack) that may not always be necessary.

I sure wouldn't want to to TRY to beat a C compiler, but it seems obvious to me that it is possible.

thaumasiotes5y ago

For goto in specific, you don't even need to do that. You don't need a goto keyword to implement goto functionality.

IgorPartola5y ago

jcelerier5y ago

> One obvious approach would be to emit code that implements a C runtime.

how could you negate the python interpreter startup time though ?

thaumasiotes5y ago

majjgepolja5y ago

C also does lack few features which can provide better performance or provide hints to the compiler.

computed goto, non-aliasing guarantees, actual "const" - just a few on top of my mind.

Generic code is generally faster in C++ unless you manually reimplement it in C.

Rust can be faster than C in few cases because stronger aliasing guarantees, but few compiler bugs prevent it.

Doing some good-performance stuff is also more difficult in C. Strings are null terminated etc..

bqmjjx0kac5y ago

Sure, managed languages can beat C performance. IIRC, the JVM can reorder branches on the fly to avoid jumps.

pkolaczk5y ago

Reordering or hinting branches has virtually zero impact on branch prediction performance on modern architectures with advanced hardware branch prediction.

limaoscarjuliet5y ago

https://www.youtube.com/watch?v=G4hL5Om4IJ4

tylerhou5y ago

However, if you used an alterative toolchain that could also generate bytecode from C at runtime, then I would bet that C would stay on top or be equal.

thysultan5y ago

You can literally emit runtime assembly in C in any C toolchain, what are you on about?

tylerhou5y ago

How do you dynamically generate machine code without linking something like LLVM?

2 more replies

offtop55y ago

sgtnoodle5y ago

ColFrancis5y ago

Just checking, even though you likely know. The numpy arrays really only pay off when you vectorise your calculations. Don't expect any speed up if you're still using list comprehensions.

A vector operation is just where instead of having two lists, U and V which add to make W = [x+y for x,y in zip(U,V)] you directly operate on the arrays, W = U + V.

condret5y ago

Interesting for me to see someone being scared of lowlevel langs. For me it's the opposite, highlevel langs scare me. They always make me feel that I don't know, what is actually going on

offtop55y ago

I don't know how cars work , I still drive.

nneonneo5y ago

Demo: https://ideone.com/CBIEAE

This creates ~60 objects, but takes something like 2^30 operations to resolve (it times out on this online runner, and takes around 5s on my laptop with -O3).

That's much worse than claimed in this paper! An accidentally-exponential algorithm is the kind of thing that makes DoS attacks trivial...

mehrdadnOP5y ago

leecarraher5y ago

brundolf5y ago

Edit: The original title "Why Python is faster than C++" is much more clickbaity than the editorialized ("When Python is faster than C++")

frankus5y ago

Here’s an article from 2001 discussing a very similar issue: https://www.joelonsoftware.com/2001/12/11/back-to-basics/

derriz5y ago

Am I missing something? In the paper the definition of cmp3 on page 2 seems to have a bug - as defined, wouldn't cmp3([1,2,3], [1]) return 0?

    # Uses 3-way cmp() for primitives
    def cmp3(a, b):
        if not isinstance(a, list):
            global c; c += 1
            return cmp(a, b) 
        for x, y in zip(a, b):
            r = cmp3(x, y)
            if r != 0: 
                return r 
        return 0

This isn't generally the expected behaviour for comparing lists, surely?

mehrdadnOP5y ago

rav5y ago

Seems like you can fix this by changing return 0 to return cmp(len(a), len(b)).

froh5y ago

If I read the paper correctly, then it compares three-way comparison with two two-way comparisons, for a recursively defined (tree) data type.

What I don't understand is what is happening in 'lt2'.

this is what I'd expect for __lt__

  lt(a, b) also known as (a __lt__ b) is returning
    True, iff a < b, elementwise for lists
    False otherwise
  for same length lists.

I also do understand cmp2.

  (a __eq__ b) iff not (a __lt__ b)  and not (b __lt__ a)

so looking at

  cmp(a, b) = lt(a, b) - lt(b, a)

I get

  a < b: 1 - 0 ==> 1
  b < a: 0 - 1 ==> -1
  a == b: 0 - 0 ==> 0

which makes sense.

Now two questions arise with respect to the presented hypothesis and the paper:

1. why does the paper call lt2 twice, recursively?

2. why does the paper compare the performance of their lt2 and lt3 instead of the performance of cmp2 and cmp3?

What am I missing?

edit: updated for clarity

attractivechaos5y ago

mehrdadnOP5y ago

froh5y ago

let me rephrase, as I indeed have read the paper, before posting.

2. the c++ behaviour in 2.4.2 is puzzling and most likely bug, worth reporting to and discussing with STL implementors.

https://www.cplusplus.com/reference/vector/vector/operators/

http://www.cplusplus.com/reference/algorithm/lexicographical...

mehrdadnOP5y ago

> 1. the lt2 definition in the paper is wrong.

Would you mind providing a counterexample to illustrate what incorrect output it's producing?

> A lexicographical compare is linear in the size.

Indeed, lt2() also has a loop that iterates a linear number of times as you mention. It is consistent with this.

> the derived cmp2 is correct and has a run time twice that of cmp3. which matches the stl definitions of lexicographical_compare, see below.

Perhaps you might be confused about what lexicographical_compare does? It does not "compare" in the 3-way sense. It only performs a "less-than" comparison. The name is rather misleading.

2. the c++ behaviour in 2.4.2 is puzzling and most likely bug, worth reporting to and discussing with STL implementors.

brundolf5y ago

Reminds me of the sscanf thing that popped up a few days ago (in fact I assumed this was about that at first): https://news.ycombinator.com/item?id=26302744

bobbylarrybobby5y ago

See this comment and children: https://news.ycombinator.com/item?id=26340233

brobdingnagians5y ago

mehrdadnOP5y ago

ac130kz5y ago

The code doesn't compile, i.e. is in the broken state. The author is missing the comparison operators.

https://godbolt.org/z/5EWhec

lorenzhs5y ago

Why do you assume -std=c++20? The code is written for C++17, and compiles just fine for that.

halayli5y ago

How can this paper be taken seriously when the paper doesn't even show the compilation flags?

flohofwoe5y ago

Once you go accidentally quadratic, any clever combination of optimization flags or compiler magic becomes quite irrelevant though.

josalhor5y ago

To be fair, certain compilation flags can change the time complexity of some algorithms if the mistake is trivial enough to figure out.

Lvl999Noob5y ago

Really? Can you give some examples? I know compilers are amazing but this seems too much.

2 more replies

coldtea5y ago

david2ndaccount5y ago

Have academic CS articles always had click-bait titles?

adgjlsfhk15y ago

Goto considered harmful is from 1968, so yes.

schot5y ago

A minor historical note on this: the original title of Dijkstra's text was "A Case Against the Goto Statement" and it was the Communications of the ACM editor (Niklaus Wirth) that changed it[0].

[0]: https://en.wikipedia.org/wiki/Considered_harmful

bregma5y ago

In 2021 the note would probably have been headlined "Academics hate programmers who use this one cool trick!!"

DarkWiiPlayer5y ago

"goto considered harmful" was at least not wrong. It didn't say who considered it harmful, but at least it was still presented as an opinion.

redeyed5y ago

Not "Python is Faster Than C++" but "When Python is Faster Than C++"

unishark5y ago

Of course this is just a preprint. If they ultimately publish it somewhere the editors/reviewers may make them give a more conservative title.

globular-toast5y ago

Why are we back to learning basic computer science? This isn't news to anyone here is it?

bregma5y ago

Most people never learned computer science. Few programmers I've met have.

acdha5y ago

The real lesson here is that nothing beats actually measuring your code to make sure you didn't miss something like this.

0823498723498725y ago

zo15y ago

We lost that battle many years ago. Is that news to you?

j / k navigate · click thread line to collapse