Downvotes coming from other connationals :) love you! I know… never say anything bad about Vaterland
Which is why they're "everywhere"... in databases, especially columnar storage.
[edit]
Lecture slide with the term is also linked from TFA: https://15721.courses.cs.cmu.edu/spring2024/slides/05-execut...
I do agree with the string imutable argument. Mutable and imutable strings have different usecases and design tradeoffs. They perhaps shouldn't be the same type at all.
The transient string is particularly brilliant. Ive worked with some low level networking code in c, and being able to create a string containing the "payload" by pointing directly to an offset in the raw circular packet buffer is very clean. (the alternative is juggling offsets, or doing excessive memcpy)
So beyond the database usecase it's a clever string format.
It would be nice to have an ISO or equivalent specification on it though.
It's not anything special? That's just `string_view` (C++17). Java also used to do that as an optimisation (but because it was implicit and not trivial to notice it caused difficult do diagnose memory leaks, IIRC it was introduced in Java 1.4 and removed in 1.7).
Just because something already exists in some language doesn't make it less clever. It's not very widespread, and it's very powerful when applicable.
This format can handle "string views" with the same logic as "normal strings" without relying on interfaces or inheritance overhead.
it's clever.
> This is where transient strings come in. They point to data that is currently valid, but may become invalid later, e.g., when we swap out the page on which the payload is stored to disk after we’ve released the lock on the page.
> Creating them has virtually no overhead: They simply point to an externally managed memory location. No memory allocation or data copying is required during construction! When you access a transient string, the string itself won’t know whether the data it points to is still valid, so you as a programmer need to ensure that every transient string you use is actually still valid. So if you need to access it later, you need to copy it to memory that you control.
Hm. What if I don't bother with that and I just read from the transient string? It's probably still good.
> In C, strings are just a sequence of bytes with the vague promise that a \0 byte will terminate the string at some point.
> This is a very simple model conceptually, but very cumbersome in practice:
> What if your string is not terminated? If you’re not careful, you can read beyond the intended end of the string, a huge security problem!
This sounds like a problem that transient strings were designed to exemplify. How do they improve on the C model?
-----
I was interested that the short strings use a full 32-bit length field. That's a lot of potential length for a string of at most 12 characters.
If we shaved that down to the four bits necessary to represent a number from 0-12, we'd save 28 bits, which is 3.5 characters. Adding three characters to the content would bring the potential length of a short string up to 15, requiring 0 additional length bits. And we'd have four bits left over.
I assume we aren't worried about this because strings of length 13-15 are already rare and it adds a huge amount of complexity to parsing the string, but it was fun to think about.
I wonder if they also have the concept of a reverse string which stores the (reversed) suffix instead and stores the short strings backward.
Niche, but would be fast for heavy ends-with filters.
These are different because the inline segment is fixed-size, and always exposes a 4 bytes prefix inline even when the buffer is stored out of line.
The main difference is that you don't know how many code points you have in the prefix as they use variable encoding so it can be up to four but as little as one. I imagine the choice of four bytes for the prefix was actually done specifically for this reason. That's the maximum length of a UTF-8 code point.
The length is not the number of characters anymore but just the size of the string.
Apart from that, it should work exactly the same.
Also, for UTF8 specifically, cutting code points in half is fine as long as all strings are valid UTF8. The UTF8 encoding is prefix free, i.e., no valid code point is a prefix of another valid code point, so for prefix matching we can usually just compare bytes.
It only gets more complicated if you add collations or want to match case-insensitively. But at that point you need to take into account all edge cases of the Unicode spec anyway.
This is how Borland Turbo Pascal stored strings as far back as the first version in mid-80s.
Length followed by the string.
This is different from Pascal strings.
Pascal strings are: { length, pointer }
In these strings:
For short strings it's storing:
{ length, string value}
for longer strings, it's storing {length, prefix, class, pointer }The historical P-strings are just a pointer, with the length at the head of the buffer. Hence length-prefixed strings, and their limitation to 255 bytes (only one byte was reserved for the length, you can still see this in the most base string of freepascal: https://www.freepascal.org/docs-html/ref/refsu9.html).
{length, pointer}
or {length, capacity, pointer}
is struct / record strings, and what pretty much every modern language does (possibly with optimisations e.g. SSO23 is basically a p-string when inline, but can move out of line into a full record string).