I had something like a set<string> or a set<set<string>> (or map... I can't remember which) somewhere in my program a few years ago, and I was trying to improve the program's performance. I tried breaking into it several times and found it quite perplexing that the bottleneck appeared to be the set<> container. I mean, I get that cache locality and all has an effect, but it seemed to be having a little too much of an effect. Why did I find this odd? Because other languages (like C#) have tree-based sets and maps too, but I'd never felt they were quite as slow as I was seeing them in C++. So I felt something weird must be going on.
I tried to step through the program for a while and observe what's going on, and at some point, I realized (perhaps I hit the same breakpoint twice? I don't recall) that these functions were being called more often than I intuitively thought would be necessary. Which was obvious in hindsight, as less() needs to be called multiple times on the same objects (4 times at level 2). Now, this hadn't resulted in quadratic behavior, but that was only because my data structures weren't arbitrarily deep—the slowdown was nevertheless a significant constant factor at the core of my program, only linear because the data structure's depth was bounded.
So once I had realized the implications of this, including that constant-factor differences can actually turn into polynomial ones, I eventually decided to post an article about it on arXiv... hence this article. Aside from the complexity issue illustrated in the article, one of my (unexpected) higher-level takeaways was that you can't just take any utility function in a program and blindly put it into a library: even if it's correct, you probably have hidden assumptions about its use cases that won't hold true in general. You really need to think about how it could be used in ways you didn't initially expect, and this may well require a different kind of code review with an entirely different mindset than before. It really drove home the point for me that writing a library is quite a bit different (and in some ways more difficult) than writing an application.
It's possible there is more theory lying underneath this that I haven't touched on—it would be interesting if someone can take this and find more interesting consequences of it. For example, maybe when analyzing algorithms we need to consider something akin to a "recursive time complexity", too? Is there another useful notion of complexity that we can generalize here?
Anyway, hope people enjoy reading it and take something useful away from it. :)
The PartialOrd trait also uses 3-way comparisons so I think the other issue is mitigated too, but it'd be interesting to check: https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html#tyme...
The first one is lt2 and lt3 implementations. You are implementing cmp2 through lt2, but lt3 through cmp3 (is this omission?). Both of them are stack-sensitive. Without being too harsh, I'm getting the impression that the intention was to write the most horrible comparison possible, which is different than worst-case time complexity.
In the paper, lt2 (actually cmp2 in the paper) will always be at least two passes, and lt3 is at least one pass. I would not say they are two/single pass algorithms because complexity increases depending on list depth when lists are involved.
Maybe I'm wrong, but both Python and C++ comparison operators are designed to be general-purpose comparison functions (and I'm more sure about C++, because this was touted through hundreds of books). As such, they should be good enough for most average cases. If you want speed, you go with balanced trees or something funkier.
Also, for C++ Tree implementation, you are again using probably the worst approach - appending to vector recursively. Use list for this. Python list implements all sorts of tricks, comparing to C++ vector.
And the last thing, but not the least: C++ containers depend on implementation (gcc libstdc++, stlport, msvc whatever), and I've seen substantial speed differences in standard operations. Hell, my old (almost conforming) list implementation was much faster than libstdc++ implementation because it wasn't trying to be too clever with slices and other magic.
I'm sad you haven't used a more scientific approach with much more rigor here: what C++ compiler was used, what version, what assembly output was produced, on what processor, after how many runs, etc... Claiming "Python is faster than C++" sounds like a clickbait title.
The context to keep in mind when reading the paper is: When designing a programming language & its standard library (or any API), we need to define an interface we can use as a building block, and we're analyzing the consequences of our choice of building blocks. In particular, we first examine the case of comparison-based data structures, which requires defining ordering primitives. In C++, the primitive is the < operator. In Python 2, it's cmp(). (In Python 3, it's a mix of < and ==, whose implications I discuss as well.) We assume user-provided types implement that basic interface, and we implement everything else we need on top of that.
So the question I'm analyzing in that example is: What happens if my primitive comparison operation is a 2-way comparison (lt(), like in C++) and then I implement 3-way comparison in terms of that (such as when I need it for for a binary search tree)? Now, what if we do the opposite: what happens if instead my primitive comparison operation is a 3-way comparison (cmp(), like in Python 2) and I only need to perform a 2-way comparison later? What are the trade-offs?
To do this, I take both approaches, implementing each in terms of the other, and compare how they behave complexity-wise. The conclusion I come to is that the choice of the primitive (which is often made by the language designers) isn't merely a question of aesthetics, but rather, it actually affects the time complexity of common operations (like traversing search trees). Similarly, the decision to cache a hash code doesn't just result in a constant-factor improvement, but it can actually change the time complexity of a program. And so on.
I think if you re-read the paper with these in mind, it should hopefully make more sense. The rest of what you said doesn't enter the picture at all... these are already balanced binary trees, the decision to use less<> is fundamentally independent of what C++ stdlib implementation you use, and the time of the vector concatenation isn't even being measured. Those things are unrelated to the point of the paper entirely. I was just trying to minimize the extraneous lines of code so we can focus on the heart of the problem instead of getting distracted by boilerplate.
2. The C++ standard library's maps and sets are known to be rather slow. See, for example:
https://stackoverflow.com/q/42588264/1593077
when you have string values, it's even worse, as you describe. But it's not clear that an overly-clever implementation, which caches numeric ranks of strings etc., is a good idea to have.
Aren't unordered_set and unordered_map quite new (IIRC, they came only with C++0x)? For most of C++'s history, if you preferred to use the standard library, what you had was only ordered sets and maps.
1) when you're using native libraries via python that were written by better c/c++ programmers than you are and you're spending most of your time within them
2) when you're using native libraries in python that are better implementations than the c/c++ libraries you're comparing against
3) when you don't know the libraries you're using in c/c++ (what they're talking about here)
...otherwise, if you're just using doing basic control flow optimizing compiler c/c++ will almost always be faster. (unless you're using numba or pypy or something).
point stands about the constants though. yes, asymptotic analysis will tell you about how an algorithm's performance scales, but if you're just looking for raw performance, especially if you have "little data", you really have to measure as implementation details of the software and hardware start to have a greater effect. (often times the growth of execution time isn't even smooth with the growth of n for small n)
This is frustrating for programmers because everyone wants to focus on the cool part of a program and forgets how much the rest takes to write, debug, etc. There are many reasons why I prefer Rust but one of the biggest is simply that having a standard package manager means that you can get high-quality code as easily as in languages like Python or JavaScript and are more likely to avoid that pitfall of reinventing wheels because it initially seems easier than finding a library.
moreover, batteries included scripting languages like perl, python, matlab, etc all tend to have the benefit of having their core bits be best of breed. perl has/had one of the best re engines out there, matlab has a great blas all compiled with the best optimizing compiler and ready to go, python was more generic i suppose, with fairly strong libraries for most things (strings, io, network io, etc).
other than the microsoft nuget stuff, the c/c++ ecosystem never really had the benefit of anything like that other than boost which was pretty tough to pull into a given project and didn't really have the community of people writing high level libraries like the scripting languages did. that said, i often used to think it would have been interesting to build a language agnostic platform for language centered library communities. (a sort of generic cpan/pip/npm in a box for pulling down libraries and running tests for any language- a combination of build system, online community platform and search engine)
but the real moral of the story: use the libraries, luke/lucia! also, know them!
In other words, is “faster than C” even a good metric if all it means that “if you implement something inefficiently in C it will be faster than if it is implemented efficiently in not-C”?
Sure, in any language that provides more semantic information than C does. For example, D enables a "pointer to immutable data" type, while C does not. This can improve optimization.
On a pragmatic note, C makes it easy to use 0-terminated strings, and clumsy to use length-terminated strings. The natural result is people use 0-terminated strings.
0-terminated strings are inefficient because of the constant need to strlen them. When I look to optimize C, I find plenty of paydirt in all the strlen instances. D makes it easy to use length-terminated strings, and so people naturally prefer them.
Wait, isn’t that what
const int *x;
does? I.e. a pointer to a constant.The compiler might not even provide access to these capabilities to a programmer who knows about them and wants to explicitly use them. In such cases, the programmer might have to use assembly, or Fortran, C++ or even something much higher level than C with a compiler that provides access or knows how to "intuit" when those capabilities are useful.
Also, there's an aspect that I feel is constantly overlooked, and this is the way that objects are laid out in memory. In Python, JavaScript, Ruby, you have a lot of pointers to objects. You get a lot of pointer-chasing as a result. This is BAD for caches. Each object you touch, each pointer you dereference means pulling in a new cache line. In C, you can design very compact and flat data structures. You can use 32-bit floats, 8-bit, or even one-bit integers if you want. You can have a struct within a struct, with an array inside of it, all without any pointer-chasing.
Modern CPUs are constrained by memory bandwidth, and it's very hard for any programming language to beat C on achievable memory efficiency. What's worse is that we have little to no academic compiler literature on automatically optimizing for memory efficient data layouts. It's an overlooked and understudied problem. You would have to prove that integer values lie within certain ranges (hard) and also do object inlining (collapsing objects into parents) which AFAIK is also hard and not done by any mainstream compiler.
So, yeah, keep thinking that a sufficiently-smart compiler will do everything for you. You will assuredly be very disappointed. Until we have strong AI, the sufficiently smart compiler is basically unobtainium. If you want efficiency, the best route is generally to have less layers of abstraction, or to only rely on compiler optimizations you know for certain will happen.
Also tasks that can be moved to the GPU go a lot faster. You can interact with those programs in C, but not natively. But some languages, like Julia, can easily move calculations to/from the GPU. And also can transparently take advantage of parallelism.
Julia is in the process of growing rapidly for high performance computing. I don't know if it has officially passed C there. But if not yet, it will.
The reason why C programs often are fast or run faster is because the language forces you to approach almost all abstraction head on - this goes both ways however, in that linked lists are much faster to write in C than a safe array type, so programs can end up with bad data structures for too long.
Since the lingua franca of the operating system is still C or C-like, there are actual optimization opportunities that go missing because of the C interface: It's hard to devise many alternatives, but if you are calling into libc for mathematics in a hot loop it may be worth avoiding it and rolling your own since the compiler can't really inline libc calls.
Well it might not be, but the fact that it "withholds a lot of information from the compiler" is an argument in favor of it being (a portable assembler), not the opposite.
This isn't true though. Just off the top of my head: C doesn't expose the carry flag directly, C doesn't let you write signed arithmetic using twos-complement overflow, C doesn't let you share a value between separate compilation units without allowing it to be mutated...
"faster" or "slower" is a pretty terrible metric anyway, given that most projects do not have infinite developer time available. But it's very hard to benchmark "equivalent developer effort" between two different languages.
Languages without arbitrary pointers don't have this issue and safely assume they knows all the assignments done to values in memory allowing for optimizations.
For example, Rust can sometimes beat C because it's (1) often more friendly to auto-vectorization and (2) it has additional aliasing information.
Similarly in languages like Java or C#, having first-class exceptions means fewer branches & error checking on the hot path over something like C. They are on the whole slower than C for other reasons, but it's not because C is "the best" or "the fastest" at everything. And of course you can't really do de-virtualization optimizations in C.
It is true that C cannot do full template meta-programming or certain state save/restore optimizations faster than setjmp/longjmp. Hard/painful/trivial are all a bit more subjective and relative to exposure/practice/training/etc.
Personally, I think Nim [2] is a much friendlier way to go than either C++ or Python without the pitfalls of either, but the ecosystem is admittedly much smaller. Also, I've yet to hit something where re-writing the C++/Rust in Nim did not speed things up from "at least a little" to "quite a bit". { This is more an artifact of "performance incautious programming" in the C++/Rust. Too many people fall into the trap that, since a language has a reputation for speed, they don't need to think. This is probably largely why @mehrdadn's original article had the title it did. ;-) }
How would you write this (admittedly contrived) Rust function in C without invoking UB:
pub fn foo(a: &mut i32, b: &mut i32) {
let (new_a, new_b) = a.checked_mul(*b)
.map(|new_a| (new_a, b.saturating_sub(*a)))
.unwrap_or((10, 20));
*a = new_a;
*b = new_b;
}
For those unfamiliar with Rust, it multiplies a by b, and if it didn't overflow:* a = a * b
* b = b - a, saturating at the minimum value (as in, it won't wrap it just stops there)
If it did overflow:
* a = 10
* b = 20
And finally, does it compiler better: https://godbolt.org/z/66n8W9
The equivalent to checked_mul is __builtin_mul_overflow, which is a compiler builtin: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.... Similarly, saturating_sub seems like it can be implemented with __builtin_sub_overflow.
That is, until things like C++'s stackless coroutines came about, which are a construct intrinsic to the compiler and not functionality directly exposed by C.
Further, any machine code language is going to allow you specific instruction access that a compiler might not otherwise utilize (rare, but it happens). In such cases you can gain 'manual' speedups over what C could allow you to do. I would hope that is the obvious exception, however.
But you are asking a very good question not a lot of developers are willing to think much about.
In my mind, different languages have different "optimization sound barriers", where further manual optimization becomes impossible, or at least runs into diminishing returns. For instance Javascript can be made nearly as fast as "vanilla" natively compiled C code, but that means writing asm.js by hand, which is entirely different from "idiomatic Javascript". It's not worth it writing asm.js by hand, when C compiled to asm.js (or WASM nowadays) is much more maintainable.
Same with garbage-collected languages in general. You can write fast code with very little overhead from the garbage collector, but this means you need to know exactly how the garbage collector of that language works, and the resulting code will usually look very different and will be harder to maintain than the idiomatic style of that language.
In standard C the optimization sound barrier is a bit further away than in many other high-level languages but it's not all that special, on the other hand C with compiler- and CPU-architecture-specific extensions is pretty good (e.g. extensions can push the sound barrier quite a bit further away).
Of course, C++'s eigen does loop fusion for you with template magic so to go really fast you probably want that.
Sometimes. I know 2 reasons.
1. In some higher-level languages, some problems can be efficiently solved with code generation. For instance, you can take some of the input data, generate code from that, then run the generated code processing the rest of the input data. Examples of the mechanism include Lisp macros, or .NET shenanigans like System.Reflection.Emit or System.Linq.Expressions.
It’s hard to generate good machine code. Possible to do in C, but you gonna need to embed C compiler for that. They are extremely complex, and were not designed for the use case e.g. relatively slow (designed to run offline on fast developer’s computers, or on even faster build servers).
If your problem is a good fit for that approach, C# can actually be faster than C.
Parsing JSON is a good example. When your goal is not just parsing the syntax, but also populating structures with the data from the JSON, good C# JSON parsers gonna runtime-generate serialization code by reflecting the structure types being populated. For an isolated program, technically you can generate equivalent C code offline. But if you’re making a general-purpose JSON serialization library that’s borderline impossible to achieve in C, need reflection and code generation.
2. Modern computers are heterogenous, they have 2 chips computing stuff, CPU and GPU. Even when these pieces are on the same chip and using the same RAM like Intel’s UHD graphics or AMD’s APUs, GPU can be still faster for some problems. Not only because more GFlops (that’s not necessarily true for integrated graphics), also because GPUs have better strategy for dealing with RAM latency. They switch threads instead of waiting for the data to arrive. CPU cores only run 1-2 hardware threads each i.e. limited in that regard.
That’s how for some problems HLSL or OpenCL can actually be faster than C.
in 2021 bundling clang along with your program is actually reasonable - if you are compiling small functions without two tons of headers it's measured in milliseconds.
I'm not sure why you think that, especially with Rust one of the reasons it can be theoretically faster is because it allows for even more undefined behavior and has a variety of datatypes that simply exist to inform the compiler of potential optimizations.
For instance, C function calls always do some things (like preserving the stack) that may not always be necessary.
I sure wouldn't want to to TRY to beat a C compiler, but it seems obvious to me that it is possible.
Turing-completeness tells you that it is in fact necessarily true that you can convert any bit of C into a corresponding bit of Python, Lisp, or Haskell. One obvious approach would be to emit code that implements a C runtime.
For goto in specific, you don't even need to do that. You don't need a goto keyword to implement goto functionality.
how could you negate the python interpreter startup time though ?
computed goto, non-aliasing guarantees, actual "const" - just a few on top of my mind.
Generic code is generally faster in C++ unless you manually reimplement it in C.
Rust can be faster than C in few cases because stronger aliasing guarantees, but few compiler bugs prevent it.
Doing some good-performance stuff is also more difficult in C. Strings are null terminated etc..
What's much more important for good performance is memory layout and good use of CPU caches. And in this area managed languages struggle a lot. For instance, every object in Java has 16 bytes of overhead for an object header (on 64-bit openjdk jvm). Or you can't organize objects in a flat array. Or there are some guarantees about memory zeroing which often lead to needless memory writes. Or you have to live with GC, which often wastes a lot additional memory and regularly brings unused but reachable memory to caches. Project Valhalla is going to improve some of these limitations hopefully some day, but don't expect the level of C, C++ or Rust performance.
However, if you used an alterative toolchain that could also generate bytecode from C at runtime, then I would bet that C would stay on top or be equal.
I also recently gave numba a go, and it was significantly slower than vanilla python. I was surprised because the decorated function seemed exactly like the type of code that numba would be good at (iterating through a list of coordinates doing square roots and some quirky floating-point modulo math.)
Demo: https://ideone.com/CBIEAE
This creates ~60 objects, but takes something like 2^30 operations to resolve (it times out on this online runner, and takes around 5s on my laptop with -O3).
That's much worse than claimed in this paper! An accidentally-exponential algorithm is the kind of thing that makes DoS attacks trivial...
I know this because I already received such criticism for my examples, with people telling me that it's unclear how wide-reaching the ramifications are in real-world applications (which I suppose is fair enough). People really want to see examples of real-world software being improved with such small tweaks in the algorithms, whereas I didn't have the bandwidth to try to investigate that, and just tried to settle for plausible examples. That criticism would be magnified many times further for DAGs and (especially) degenerate linked lists. (I'm not claiming these are the only relevant scenarios by the way—just saying this is how it is likely to be seen by many readers.) I moved on and didn't spend more time on this (it was kind of a random paper idea I had and not related to anything else), but I think it would be awesome to flesh this out further into something more interesting and compelling and properly publish it.
If you find this interesting and have the time to help joint-author & actually publish a paper on this, please grab my email from arXiv and email me! It would be great to flesh out more consequences and find more interesting examples together. I feel like there's a lot more underneath the surface that I might not have touched (both theoretically and practically), but I hadn't managed to gather enough interest from others in the topic (let alone the examples) to motivate me to look further until now!
Edit: The original title "Why Python is faster than C++" is much more clickbaity than the editorialized ("When Python is faster than C++")
# Uses 3-way cmp() for primitives
def cmp3(a, b):
if not isinstance(a, list):
global c; c += 1
return cmp(a, b)
for x, y in zip(a, b):
r = cmp3(x, y)
if r != 0:
return r
return 0
This isn't generally the expected behaviour for comparing lists, surely?The paper points out that the convenience of just defining "less than", and heaving "equals" derived from that can be costly. Specifically, for the recursively defined data structure (tree), a three-way comparison which is derived canonically from the two way compare seems to entail not a linear but an quadratic number of comparisons.
What I don't understand is what is happening in 'lt2'.
this is what I'd expect for __lt__
lt(a, b) also known as (a __lt__ b) is returning
True, iff a < b, elementwise for lists
False otherwise
for same length lists.
I also do understand cmp2. (a __eq__ b) iff not (a __lt__ b) and not (b __lt__ a)
so looking at cmp(a, b) = lt(a, b) - lt(b, a)
I get a < b: 1 - 0 ==> 1
b < a: 0 - 1 ==> -1
a == b: 0 - 0 ==> 0
which makes sense.Now two questions arise with respect to the presented hypothesis and the paper:
1. why does the paper call lt2 twice, recursively?
2. why does the paper compare the performance of their lt2 and lt3 instead of the performance of cmp2 and cmp3?
I intuit, when taking the double recursion out of lt2, which imnsho is erroneous, and when comparing cmp2 and cmp3, we'll see a performance penalty of a factor 2, between cmp2 and cmp3, and identical run times for lt2 and lt3, as it should be.
What am I missing?
edit: updated for clarity
Note that you do need to read the entire paper to see the overall picture and its consequences; if you stop halfway then it might appear like I'm comparing things unfairly. Feel free to let me know if anything is still confusing after reading the other comments and taking another look at the paper and I'll try to clarify!
1. the lt2 definition in the paper is wrong. A lexicographical compare is linear in the size. the derived cmp2 is correct and has a run time twice that of cmp3. which matches the stl definitions of lexicographical_compare, see below.
2. the c++ behaviour in 2.4.2 is puzzling and most likely bug, worth reporting to and discussing with STL implementors.
https://www.cplusplus.com/reference/vector/vector/operators/
http://www.cplusplus.com/reference/algorithm/lexicographical...
I wonder (genuinely asking, not being snarky) what it is about C/C++ that seems to make these issues more common? It's also possible my perception of "more common" has just been inflated by seeing multiple examples in a single week
This title states "Python is Faster Than C++", which neither implies that this is just an opinion, nor that it isn't an absolute statement. You have to figure out yourself that it's probably hyperbolic and just referring to special cases.
Of course this is just a preprint. If they ultimately publish it somewhere the editors/reviewers may make them give a more conservative title.
The C++ code shown is a great example: when you see very simple code in the middle of a paper talking about how a particular patterns fails badly, yes, you're primed to look for a problem but if that showed up for you in code review are you really confident that you'd say something other than “followed standard practice, maybe add braces around the if statement”?
The real lesson here is that nothing beats actually measuring your code to make sure you didn't miss something like this.