undefined | Better HN

0 pointsvbezhenar5y ago0 comments

It might be fast, but it'll load CPU caches with that data and it'll evict another useful data. Which means that while this particular code will be fast or at least not very slow, some other code will be slow because its data have to be fetched again.

I have no idea whether that matters or even easy to measure...

0 comments

josephg5y ago

> I have no idea whether that matters or even easy to measure...

It is reasonably easy to measure, and the GP is about right. I've measured a crossover point of around a few hundred items too. (Though I'm sure it'll vary depending on use case and whatnot.)

I made a rope data structure a few years ago in C. Its a fancy string data structure which supports inserts and deletes of characters at arbitrary offsets. (Designed for text editors). The implementation uses a skip list (which performs similarly to a b-tree). At every node we store an array of characters. To insert or delete, we traverse the structure to find the node at the requested offset, then (usually) memmove a bunch of characters at that node.

Q: How large should that per-node array be? A small number would put more burden on the skip list structure and the allocator, and incur more cache misses. A large number will be linearly slower because of all the time spent in memmove.

Benchmarking shows the ideal number is in the ballpark of 100-200, depending on CPU and some specifics of the benchmark itself. Cache misses are extremely expensive. Storing only a single character at each node (like the SGI C++ rope structure does) makes it run several times slower. (!!)

Code: https://github.com/josephg/librope

This is the constant to change if you want to experiment yourself:

https://github.com/josephg/librope/blob/81e1938e45561b0856d4...

In my opinion, hash tables, btrees and the like in the standard library should probably swap to flat lists internally when the number of items in the collection is small. I'm surprised more libraries don't do that.

loup-vaillant5y ago

> In my opinion, hash tables, btrees and the like in the standard library should probably swap to flat lists internally when the number of items in the collection is small. I'm surprised more libraries don't do that.

If I recall correctly, the STL provides guarantees that prevents it from taking advantage of flat lists. I think some containers (not arrays) guarantee that they don't move the address of whatever they're containing, even if you insert or remove elements. Even if they switched to flat lists, it would be a flat list of pointers, with all the incurred overhead.

Likewise, I believe I have heard of small string optimisation being impossible with std::string for similar reasons.

leetcrew5y ago

> Likewise, I believe I have heard of small string optimisation being impossible with std::string for similar reasons.

not impossible, SSO is implemented to some extent in most mainstream c++ compilers.

1 more reply

joshuamorton5y ago

I seem to recall some of this related to absl's flat_hash_map and co. They can't be stdlib compatible because there are api concerns that the std implementations provide (but that in practice no one uses) that result in real world performance being left on the table.

1 more reply

frutiger5y ago

Yes, the address of any element of an std::map is valid until that element is erased.

jstimpfle5y ago

Thank you, I tried the rope, and I ran a benchmark that creates a rope of length 1G by repeatedly inserting 1023 bytes at random positions. Some notes:

When I changed the hardcoded node size from 136 to 1024 bytes, time went down from 3.6 secs to 2.6 secs on my laptop. It kind of plateaued at 1024 bytes. I didn't do more extensive testing.

What's the rationale for the choice of 136? A cache line is usually 64, so the CPU will always end up loading a multiple of that in any case.

When I did a rope implementation myself (I don't know how it compares in terms of performance) I think I ended up with nodes of 4096 or 8192 bytes size. That was based on a Red-black tree. When I load ~1Gig of data, even with a node size of 4096, there will still be 2^18 nodes, meaning each access requires sth. like 17 child traversals, which seems a lot to me. So I can't see myself going down to 200 bytes or less.

josephg5y ago

Oh interesting! What CPU was that on? 136 the optimal number I found benchmarking on an old intel based laptop I was using a few years ago. I wonder if modern CPUs (with much larger caches) have changed the ideal array size.

1 more reply

jimbokun5y ago

Here's an article about how NSArray on MacOS does something similar:

https://ridiculousfish.com/blog/posts/array.html

In the sense of swapping the underlying implementation as the number of items in the collection increases.

scottlamb5y ago

Conversely, more sophisticated algorithms, and particularly monomorphization, may bloat the code size, causing icache, L2/L3 cache, and TLB misses. (Also slows compilation.) I think this is under-appreciated because it doesn't show in microbenchmarks. (I wish I knew of an easy way to measure how much cache pressure you're causing from a microbenchmark.)

I suspect for this reason it'd be better in Rust to use a Go-like hash map implementation [1] that keeps all the key/value information (size, Hash and Eq implementations) in a vtable-like form rather than be monomorphized, except in really hot inner loops where the specialization is worth it. There was an interesting article and discussion on reddit relating to this [2] where someone made a toy type-erased map (though not as nice as Go's) to measure the difference in compilation times. Maybe some day I'll make a more production-ready attempt...

[1] https://dave.cheney.net/2018/05/29/how-the-go-runtime-implem...

[2] https://www.reddit.com/r/rust/comments/g1skz4/an_experiment_...

pbadenski5y ago

Thank you for mentioning monomorphization - I wasn't aware of this concept.

Would you maybe if there's any source discussing how this is solved in different languages and performance consequences of it?

bicolao5y ago

Not really answering your questions, but regarding monomorphization, you can see a comparison of code size and compilation speed between printf, std::format and C++ iostream. The last one is monomophization, I believe.

https://github.com/fmtlib/fmt#compile-time-and-code-bloat

scottlamb5y ago

I'm not aware of a high-quality source discussing this. I'd be interested in seeing one. Off the top of my head:

* In many languages, monomorphization is impractical. Eg, in C it requires macros or external code generators. Java and Go don't even have macros. So it's rarely done. Instead, stuff uses type erasure, dynamic dispatch, more heap allocations, and tables like the ones described in that dave.cheney.net link. Maybe some JITs or even some ahead-of-time compilers do monomorphization internally in some cases, but I'm speculating rather than speaking from knowledge.

* In C++ and Rust, monomorphization is easy. You don't have to do it, but it's easy, so many people do. The containers in the standard library are monomorphized. This code can perform very well in any particular case, but in aggregate it's large enough that I have doubts about whether it's always worth it for the whole program.

forrestthewoods5y ago

You’re stretching there imho.

L1 and L2 are per core. A cacheline is 64 bytes and per-core cache size is on the order of 32kb and 256kb for L1/L2.

Reading data from L1 is on the order of 1 nanosecond (or less) and RAM on the order of 50 nanoseconds.

If you’re scanning an array and load a dozen cachelines that’s almost certainly preferable to several cache-misses (and lines).

Memory access is very often an application’s bottleneck. The answer is almost always more arrays and fewer pointers.

naikrovek5y ago

> The answer is almost always more arrays and fewer pointers.

The number of people who dismiss the lowly array is way too high. Arrays are fast. Keep your data in the cache and they're 10x or more faster than normal, and plain flat arrays are almost always faster than literally any OOP nonsense. By "almost always" I mean that I've never once encountered a situation where flat arrays weren't the fastest solution attempted, and I haven't seen everything, so I can't claim they're always fastest.

People really don't understand how fast their computers really are, thanks to developers not caring how fast their code is.

snicker75y ago

The CPU will load less cache data with linear searches. This is because you will have less cache misses. Less cache misses == less loading from memory. With pointer-heavy data structures, you load more from memory, and most of the stuff you do load is useless.

jstimpfle5y ago

A binary tree might require you to visit only O(log2(N)) of your data, while a linear array on average requires you to visit one half. How does that correspond to less loading from memory?

Linear arrays are often faster, not because they require fewer memory loads, but because the cache has intelligence built in (the prefetcher) that loads some memory in advance even before the code requests it. That intelligence works way better with linear array scans than with pointer chasing. (I'm not sure if these loads will count as cache misses or not).

snicker75y ago

You need to look at the coefficients. Memory queries pull in 64 bytes of contiguous memory at a time (actually more with speculative prefetching). That extra data is used in a linear scan, but is mostly wasted when doing random access. You also have other bottlenecks with binary search, e.g. branch misprediction.

Yes, for large enough N, a BST easily beats linear search.

1 more reply

j / k navigate · click thread line to collapse

0 comments

josephg5y ago

> I have no idea whether that matters or even easy to measure...

It is reasonably easy to measure, and the GP is about right. I've measured a crossover point of around a few hundred items too. (Though I'm sure it'll vary depending on use case and whatnot.)

Code: https://github.com/josephg/librope

This is the constant to change if you want to experiment yourself:

https://github.com/josephg/librope/blob/81e1938e45561b0856d4...

loup-vaillant5y ago

Likewise, I believe I have heard of small string optimisation being impossible with std::string for similar reasons.

leetcrew5y ago

> Likewise, I believe I have heard of small string optimisation being impossible with std::string for similar reasons.

not impossible, SSO is implemented to some extent in most mainstream c++ compilers.

1 more reply

joshuamorton5y ago

1 more reply

frutiger5y ago

Yes, the address of any element of an std::map is valid until that element is erased.

jstimpfle5y ago

Thank you, I tried the rope, and I ran a benchmark that creates a rope of length 1G by repeatedly inserting 1023 bytes at random positions. Some notes:

When I changed the hardcoded node size from 136 to 1024 bytes, time went down from 3.6 secs to 2.6 secs on my laptop. It kind of plateaued at 1024 bytes. I didn't do more extensive testing.

What's the rationale for the choice of 136? A cache line is usually 64, so the CPU will always end up loading a multiple of that in any case.

josephg5y ago

1 more reply

jimbokun5y ago

Here's an article about how NSArray on MacOS does something similar:

https://ridiculousfish.com/blog/posts/array.html

In the sense of swapping the underlying implementation as the number of items in the collection increases.

scottlamb5y ago

[1] https://dave.cheney.net/2018/05/29/how-the-go-runtime-implem...

[2] https://www.reddit.com/r/rust/comments/g1skz4/an_experiment_...

pbadenski5y ago

Thank you for mentioning monomorphization - I wasn't aware of this concept.

Would you maybe if there's any source discussing how this is solved in different languages and performance consequences of it?

bicolao5y ago

https://github.com/fmtlib/fmt#compile-time-and-code-bloat

scottlamb5y ago

I'm not aware of a high-quality source discussing this. I'd be interested in seeing one. Off the top of my head:

forrestthewoods5y ago

You’re stretching there imho.

L1 and L2 are per core. A cacheline is 64 bytes and per-core cache size is on the order of 32kb and 256kb for L1/L2.

Reading data from L1 is on the order of 1 nanosecond (or less) and RAM on the order of 50 nanoseconds.

If you’re scanning an array and load a dozen cachelines that’s almost certainly preferable to several cache-misses (and lines).

Memory access is very often an application’s bottleneck. The answer is almost always more arrays and fewer pointers.

naikrovek5y ago

> The answer is almost always more arrays and fewer pointers.

People really don't understand how fast their computers really are, thanks to developers not caring how fast their code is.

snicker75y ago

jstimpfle5y ago

A binary tree might require you to visit only O(log2(N)) of your data, while a linear array on average requires you to visit one half. How does that correspond to less loading from memory?

snicker75y ago

Yes, for large enough N, a BST easily beats linear search.

1 more reply

j / k navigate · click thread line to collapse