But for MSVC and GCC we're told 16 bytes as each has a 16-byte buffer. However they still need that obligatory zero byte, for ASCII NUL so surely the table should show 15.
The most popular string in C++, the one which makes SSO almost obligatory as a design feature for the language's string object, is the empty string. On a modern (64-bit) machine Clang can store that inline in a local 24-byte object, MSVC and GCC need 32-bytes. Without SSO it would mean probably 24 bytes and a heap allocation. That's hard to swallow.
I wonder if anyone has ever tried an implementation that prioritizes even more minimal memory usage for small strings?
Of the three implementations discussed, the smallest (on a 64-bit system) is 24 bytes.
How about 8 bytes: the union of a single pointer and a char[8]?
For strings of 6 or fewer characters, it uses the first 6 bytes to store the string, one for the null terminator, and the last byte to store the length, but with a trick (see below).
For strings of 7 or more characters, it uses all 8 to store the address of a larger block of memory storing the capacity, size, and string bytes.
The trick is to use the last byte as a way to know whether the bytes are encoding a short string or a pointer. Since pointers will never be odd, we just store an odd number in that last byte. For example, you could store (size << 1) | 0x1
So if the last bit is 1, it's a short string. The size is bytes[7] >> 1, and the data() pointer is just equal to the string itself.
If the last bit is 0, treat the whole data as a pointer to a structure that encodes the capacity and size and then string bytes as usual.
Generally when you do memory compactification of non-serialized structures it’s for cache related performance improvements, so unless the common case is overwhelmingly small strings there’s probably not a benefit. It’s worth testing to see where this is an improvement, but my guess is that it’s in extremely limited regimes, but maybe some bit manipulations instead of traditional byte scan could yield improvements, likewise it may be worth going to larger SSO sizes for SIMD and a larger scope of SSO cases.
If you’re interested in data structure optimizations like this there’s a cppcon talk about Facebook’s string implementation that has some clever tricks.
So for loop conditions like
for (size_t i = 0; i < str.size(); i++)
are cheaper to calculate?As much as these snippets make clang look heavier, I wonder what it compiles to in practice when the compiler can make better inferences. If you can prove the state of the `is_small` bit those branches disappear. Even at runtime, which implementation is more performant? Real-world profiling may favor clang with branch prediction and speculative processing. Then again, speculation has become a dirty word lately.[1]
[1] Get it? "Dirty" because of the cache. I'm sorry, that pun was entirely unintentional.
In the Microsoft world, BSTR from COM/OLE does this. Though I think the length prefix is 4 bytes, and the payload is 16 bit wchars.
When placing a bet on real workload performance, I’d take those attributes every day of the week and twice on Sundays.
not sure about "smaller" - cacheline is 32bytes or larger on all modern 64bit cpus
Though really the performance gates on string-heavy code tend to be in the heap and not the string library itself.
Clang developers are not stupid, if they did that instead of copying what GCC or MSVC did (they both predate clang), there is a good reason, and my guess is that it is better aligned with how modern architectures and optimizers work. Anyways, that's a tradeoff, but on a modern high performance computer, intuitively, I would side with clang.
modern 64bit cpus has no less than 32bytes cacheline