I have no idea whether that matters or even easy to measure...
It is reasonably easy to measure, and the GP is about right. I've measured a crossover point of around a few hundred items too. (Though I'm sure it'll vary depending on use case and whatnot.)
I made a rope data structure a few years ago in C. Its a fancy string data structure which supports inserts and deletes of characters at arbitrary offsets. (Designed for text editors). The implementation uses a skip list (which performs similarly to a b-tree). At every node we store an array of characters. To insert or delete, we traverse the structure to find the node at the requested offset, then (usually) memmove a bunch of characters at that node.
Q: How large should that per-node array be? A small number would put more burden on the skip list structure and the allocator, and incur more cache misses. A large number will be linearly slower because of all the time spent in memmove.
Benchmarking shows the ideal number is in the ballpark of 100-200, depending on CPU and some specifics of the benchmark itself. Cache misses are extremely expensive. Storing only a single character at each node (like the SGI C++ rope structure does) makes it run several times slower. (!!)
Code: https://github.com/josephg/librope
This is the constant to change if you want to experiment yourself:
https://github.com/josephg/librope/blob/81e1938e45561b0856d4...
In my opinion, hash tables, btrees and the like in the standard library should probably swap to flat lists internally when the number of items in the collection is small. I'm surprised more libraries don't do that.
If I recall correctly, the STL provides guarantees that prevents it from taking advantage of flat lists. I think some containers (not arrays) guarantee that they don't move the address of whatever they're containing, even if you insert or remove elements. Even if they switched to flat lists, it would be a flat list of pointers, with all the incurred overhead.
Likewise, I believe I have heard of small string optimisation being impossible with std::string for similar reasons.
not impossible, SSO is implemented to some extent in most mainstream c++ compilers.
When I changed the hardcoded node size from 136 to 1024 bytes, time went down from 3.6 secs to 2.6 secs on my laptop. It kind of plateaued at 1024 bytes. I didn't do more extensive testing.
What's the rationale for the choice of 136? A cache line is usually 64, so the CPU will always end up loading a multiple of that in any case.
When I did a rope implementation myself (I don't know how it compares in terms of performance) I think I ended up with nodes of 4096 or 8192 bytes size. That was based on a Red-black tree. When I load ~1Gig of data, even with a node size of 4096, there will still be 2^18 nodes, meaning each access requires sth. like 17 child traversals, which seems a lot to me. So I can't see myself going down to 200 bytes or less.
https://ridiculousfish.com/blog/posts/array.html
In the sense of swapping the underlying implementation as the number of items in the collection increases.
I suspect for this reason it'd be better in Rust to use a Go-like hash map implementation [1] that keeps all the key/value information (size, Hash and Eq implementations) in a vtable-like form rather than be monomorphized, except in really hot inner loops where the specialization is worth it. There was an interesting article and discussion on reddit relating to this [2] where someone made a toy type-erased map (though not as nice as Go's) to measure the difference in compilation times. Maybe some day I'll make a more production-ready attempt...
[1] https://dave.cheney.net/2018/05/29/how-the-go-runtime-implem...
[2] https://www.reddit.com/r/rust/comments/g1skz4/an_experiment_...
Would you maybe if there's any source discussing how this is solved in different languages and performance consequences of it?
* In many languages, monomorphization is impractical. Eg, in C it requires macros or external code generators. Java and Go don't even have macros. So it's rarely done. Instead, stuff uses type erasure, dynamic dispatch, more heap allocations, and tables like the ones described in that dave.cheney.net link. Maybe some JITs or even some ahead-of-time compilers do monomorphization internally in some cases, but I'm speculating rather than speaking from knowledge.
* In C++ and Rust, monomorphization is easy. You don't have to do it, but it's easy, so many people do. The containers in the standard library are monomorphized. This code can perform very well in any particular case, but in aggregate it's large enough that I have doubts about whether it's always worth it for the whole program.
L1 and L2 are per core. A cacheline is 64 bytes and per-core cache size is on the order of 32kb and 256kb for L1/L2.
Reading data from L1 is on the order of 1 nanosecond (or less) and RAM on the order of 50 nanoseconds.
If you’re scanning an array and load a dozen cachelines that’s almost certainly preferable to several cache-misses (and lines).
Memory access is very often an application’s bottleneck. The answer is almost always more arrays and fewer pointers.
The number of people who dismiss the lowly array is way too high. Arrays are fast. Keep your data in the cache and they're 10x or more faster than normal, and plain flat arrays are almost always faster than literally any OOP nonsense. By "almost always" I mean that I've never once encountered a situation where flat arrays weren't the fastest solution attempted, and I haven't seen everything, so I can't claim they're always fastest.
People really don't understand how fast their computers really are, thanks to developers not caring how fast their code is.
Linear arrays are often faster, not because they require fewer memory loads, but because the cache has intelligence built in (the prefetcher) that loads some memory in advance even before the code requests it. That intelligence works way better with linear array scans than with pointer chasing. (I'm not sure if these loads will count as cache misses or not).
Yes, for large enough N, a BST easily beats linear search.