undefined | Better HN

0 pointsburntsushi5y ago0 comments

ack might be an example. It's Perl, not Python, and its author is on record as saying that performance isn't his goal. So it's a bit of a strained one. But yes, it's true, I don't know any other serious grep clone in a language like Python. This is why I hedged everything initially by saying that I know that absence of evidence isn't evidence of absence. :-) And in particular, I framed this as, "I would learn something," rather than, "this is objective fact." So long as my standard is my own experience, the hand wavy aspect of this works a bit better IMO.

> I do not know of a competitor who has put the care into such a tool to even try to measure, though.

Right. Like for example, I am certain enough about my claim that I would never even attempt to do it in the first place. I would guess that others think the same. With that said, people have written grep's in Python and the like, and last time I checked, they were very slow. But yeah, the "development effort" angle of this likely makes such tools inappropriate for a serious comparison to support my claim. But then again, if I'm right, the development effort required to make a Python grep be as fast as ripgrep is insurmountable.

> it all seems pretty hot spot optimizable to me

As long as we're okay with being hand wavy, then I would say that it's unlikely. Many of the optimizations in ripgrep have to do with amortizing allocation, and that kind of optimization is just nearly completely absent in a language like Python unless you drop down into C. This amortization principle is pervasive and applies as deep as regex internals to the code the simply prints ripgrep's output (which is in and of itself a complex beast and quite performance sensitive in workloads with lots of matches), and oodles of stuff inbetween.

> { And I love rg. Many thanks and no offense at all was intended! }

:-) No offense taken. This is by far the best convo I'm having in this HN post. Lol.

Note that I've made similar claims before. In the last one, there is a lot more data: https://news.ycombinator.com/item?id=17943509

0 comments

cb3215y ago

When I used to write in Cython + NumPy I would pre-allocate numpy arrays written into by Cython. It's C-like, but because of the gradual typing I think firmly in the higher level (for some value of "er"). One can certainly do that stuff in Nim/SBCL/etc. (and one sees it done).

While allocation is pretty pervasive, I'm skeptical that everywhere or even most places you do it is an important perf bottleneck. Without a count of these 20 times it matters and these 40 it doesn't, it's just kind guesswork from an all too often frail human memory/attention that "ignores the noise" by its very nature. You might be right. Just trying to add some color. :-)

Another way to think of this is to imagine your own codebase "in reverse". "If I drop this optim, would I see it on that profile?" Or look at the biggest piles of code in your repo and ask "Is this in the critical path/really perf necessary?" and the like. Under the assumption that higher level things would be a lot shorter that kind of thought experiment can inform. Maybe an approach toward more objectivity, anyway. Little strewn about tidbits in every module don't really count { to me :-) } - that speaks more to abstraction problems.

But I don't think there is a lot of value in all the above gendankenizing. While I realize some bad "just throw money at it" kicked this off, one of my big objections to the entire framing is that I think people and their APIs really "alter the level" of a language. Indeed their experience with the language has big impact there. Every one reading this knows C's `printf(fmt, arg1, arg2,..)`. Yet, I'd bet under 1% have heard of/thought to do an allocating (or preallocated) string builder variant like `str(sub1, sub2, ..., NULL)` or using/acquiring something like glibc's `asprintf`. People will say "C is low level - It has no string concatenation!". Yet, inside of an hour or two most medium-skill devs could write my above variadic string builder or learn about vasprintf. Or learn about Boehm-Wiser for garbage collected C or 100 other things like that CTL I mentioned elsewhere in this thread.

So what "level" is C, the language? Beats me. Does it have concatenation? Well, not spelled "a+b" but maybe spelled not much worse "str(a,b,NULL)". Level all just depends so much on how you use it. Performance is similar. Much C++ (and Rust for that matter) is terribly inefficient because of reputations for being "fast languages" leading to less care (or maybe just being done by junior devs..). These "depends" carry over to almost anything..not just Rust or C, but sometimes even English. I am usually told I write in much too detailed a way and a trimmer way might have higher persuasion/communication performance! { How's that for "meta"? ;-) }

> This is by far the best convo I'm having in this HN post. Lol.

Cool, cool. There can be a lot of "Rust Rage" out there (in both directions, probably). :)

Anyway, I don't think we'll resolve anything objective here, but don't take a lack of response as indicating anything other than that. You aren't making any strong objective claims to really rebut and I'm glad that you personally undertook the challenge to do ripgrep in any language. I do think many might have done..maybe Ada, too, and probably many more, but maybe all at the same "realized levelness". You just did not know them/feel confident about getting peformance in them. Which is fine. A.Ok, even! I guess your other biggy is Go and that might actually not have worked of all the alternatives bandied about by pjmlp and myself so far.

burntsushiOP5y ago

> While allocation is pretty pervasive, I'm skeptical that everywhere or even most places you do it is an important perf bottleneck. Without a count of these 20 times it matters and these 40 it doesn't, it's just kind guesswork from an all too often frail human memory/attention that "ignores the noise" by its very nature. You might be right. Just trying to add some color. :-)

In general I agree. But I'm saying what I'm saying because of all the times I've had to change my code to amortize allocation rather than not do it. It's just pervasive because there are all sorts of little buffers everywhere in different parts of the code. And those were put there because of experimenting that said the program benefited from them.

The key here is that the loops inside of ripgrep can grow quite large pretty quickly. There's the obvious "loop over all files," and then there's "loop over all lines" and then "loop over all matches." ripgrep has to do work in each of those loops and sometimes the work requires allocation. Even allocations at the outermost loop (looping over all files) can cause noticeable degradations in speed for some workloads.

This is why I'm so certain.

The numpy example is a good one where a substantial amount of code has been written to cater to one very specific domain. And in that domain, it's true, you can write programs that are very fast.

> So what "level" is C, the language?

Oh I see, I don't think I realized you wanted to go in this direction. I think I would just say that I absolutely agree that describing languages as "levels" is problematic. There's lots of good counter examples and what not. For example, one could say that Rust is both high level and low level and still be correct.

But like, for example, I would say that "Python is high level" is correct and "Python is low level" is probably not. But they are exceptionally general statements and I'm sure counter-examples exist. They are, after all, inherently relativistic statements, so your baseline matters.

That's kind of why I've stayed in "hand wavy" territory here. If we wanted to come down to Earth, we could, for example, replace "high level languages" in the poster's original statement with something more precise but also more verbose that this discussion still largely fits.

> I am usually told I write in much too detailed a way and a trimmer way might have higher persuasion/communication performance! { How's that for "meta"? ;-) }

Yeah, it's hard to be both pithy and precise. So usually when one is pithy, it's good to take the charitable interpretation of it. But we are technical folks, and chiming in with clarifications is to be expected.

> I don't think we'll resolve anything objective here

Most definitely. At the end of the day, I have a prior about what's possible in certain languages, and if that prior is proven wrong, then invariably, my mental model gets updated. Some priors are stronger than others. :-)

> You aren't making any strong objective claims to really rebut

Right. Or rather, my claims are rooted in my own experience. If we were going to test this, we'd probably want to build a smaller model of ripgrep in Rust, then try implementing that in various languages and see how far we can get. The problem with that is that the model has to be complex enough to model some reasonable real world usage. As you remove features from ripgrep, so to do you remove the need for different kinds of optimizations. For example, if ripgrep didn't have replacements or didn't do anything other than memory map files, then that's two sources of alloc amortization that aren't needed. So ultimately, doing this test would be expensive. And that's ignoring the quibbling folks will ultimately have about whether or not it's fair.

> I guess your other biggy is Go and that might actually not have worked of all the alternatives bandied about by pjmlp and myself so far.

I would guess Go would have a much better shot than Python. But even Go might be tricky. Someone tried to write a source code line counter in Go, put quite a bit of effort into it, and couldn't get over the GC hurdle: https://boyter.org/posts/sloc-cloc-code/ (subsequent blog posts on the topic discuss GC as well).

cb3215y ago

I feel we've talked past each other about what is/is not Python a few times. There is Cython and Pythran and Pypy and ShedSkin and Numba and others that are targeting, for lack of a more precise term, "extreme compatibility with" CPython, but also trying to provide an escape hatch for performance which includes in-language low levelness including allocation tricks that are not "mainstream CPython" (well, Pypy may not have those...).

My first reply was questioning "what counts" as "Python". Cython is its own language, not just "C", nor just "Python", but can do "low level things" such as using C's alloca. Maybe the only prior update here is on the diversity of "Python" impls. There are a lot. This is another reason why language levelness is hard to pin down which was always my main point, upon which we do not disagree. Maybe this is what you meant by "exceptionally general", but I kinda feel like "there isn't just one 'Python'" got lost. { There used to be a joke.."Linux isn't" related to the variety of distros/default configs/etc. :-) }

Advice-wise, I would say that your claim can be closer to easily true if you adjust it to say "ripgrep needs 'low level tricks' to be fast and a language that allows them, such as Rust". That phrasing side-steps worrying about levelnesses in the large of programming languages, re-assigns it to techniques which is more concrete and begs the question of technique enumeration. That is the right question to beg, though, if not in this conversation then in others. You might learn how each and every technique has representation in various other programming languages. It's late for me, though. So, good night!

1 more reply

j / k navigate · click thread line to collapse

0 comments

cb3215y ago

> This is by far the best convo I'm having in this HN post. Lol.

Cool, cool. There can be a lot of "Rust Rage" out there (in both directions, probably). :)

burntsushiOP5y ago

This is why I'm so certain.

The numpy example is a good one where a substantial amount of code has been written to cater to one very specific domain. And in that domain, it's true, you can write programs that are very fast.

> So what "level" is C, the language?

> I am usually told I write in much too detailed a way and a trimmer way might have higher persuasion/communication performance! { How's that for "meta"? ;-) }

> I don't think we'll resolve anything objective here

> You aren't making any strong objective claims to really rebut

> I guess your other biggy is Go and that might actually not have worked of all the alternatives bandied about by pjmlp and myself so far.

cb3215y ago

1 more reply

j / k navigate · click thread line to collapse