undefined | Better HN

0 pointsRochus1y ago0 comments

That is why we should always use a standardized, controlled benchmark suite, which has well-defined rules to assure fair cross-language comparisons with a representative, well-balanced workload. By focusing on a core set of language features and abstractions, Are-we-fast-yet allows for a more controlled comparison of language implementation performance, isolating the effects of compiler and runtime optimizations.

This is especially important for scripting languages like Python, where a large part of the features are implemented in C or other native languages and called via FFI. That's why, for example, the benchmark implements its own collections, because we want to know how fast the interpreter is. Otherwise, as you have noticed, the result is randomly influenced by how much compute a particular application can delegate to the FFI.

0 comments

masklinn1y ago

> That's why, for example, the benchmark implements its own collections, because we want to know how fast the interpreter is. Otherwise, as you have noticed, the result is randomly influenced by how much compute a particular application can delegate to the FFI.

That sounds like the exact opposite of what I would want as a user of the language: the benchmark completely abstracts the actual behaviour of the runtime, claiming purported gains which don’t come anywhere near manifesting when trying to run actual software.

I’m not implementing my own collections when `dict` suffices, and I don’t really care that a pure python version of `re` runs faster in graal than in cpython, because I’m not using that.

So what happens is I see claims that graalpython runs 17 times faster than cpython, I try it out, it runs 6 times slower instead, and I can only conclude that graal is a worthless pile of lies and I should stop caring.

RochusOP1y ago

If you don't know exactly what you are measuring, the measurement is worthless. We must therefore isolate the measurement subject for the measurement, and avoid uncontrollable influences as far as possible. This is how engineering works, and every engineer should also be aware of measurement errors. In addition, repeatability and falsifiability of the experiment and conclusions are required for scientific claims. The mere statement "too slow to be acceptable" or "worthless pile of lies" is not enough for this.

A measurement method does not have to represent every practical application of the measured subject. In the present case, the measurement allows a statement to be made about the performance of the interpreter (CPython) in relateion to the JIT compiler (GraalPy). Whether the technology is right for your specific application or not is another question.

igouy1y ago

Their point seems to be a tautology: If what you are measuring is worthless, the measurement is worthless.

Whether the measurement is in-some-sense formally correct or not is another question.

1 more reply

j / k navigate · click thread line to collapse