Never create Ruby strings longer than 23 characters (opens in new tab)

(patshaughnessy.net)

54 pointsctaglia12y ago59 comments

59 comments

nly12y ago

This is known as the "small string optimisation" in C++, so you can see a similar implementation in Clangs libc++[1].

One interesting corollary is that moving short strings in an implementation that does this could actually be ever so slightly (negligibly) slower than moving long ones (since byte copies are slower than word copies). But generally, this is a free lunch optimisation and can save you hundreds of megs of memory when writing programs dealing with millions of short strings.

[1] http://llvm.org/svn/llvm-project/libcxx/trunk/include/string - search for "union"

Someone12y ago

http://www.slideshare.net/nirusuma/what-lies-beneath-the-bea... (from march 2012) also discusses this.

Also (pedantic):

   #define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))

sizeof(char) is always 1, so that division is superfluous.

BudVVeezer12y ago

sizeof(char) is implementation defined; see limits.h for more information on the probable size of char for your target. If the sizeof(char) is 1, the division will be optimized away, so there's no loss by keeping the code portable.

Sharlin12y ago

No, the size of char (in bits) is implementation defined, but sizeof(char) is defined to be 1, no matter what its size in bits.

1 more reply

chollida112y ago

No this is incorrect, see:

http://stackoverflow.com/q/4562249/25981

sizeof(char) is always defined to be one. This can't be altered by a conforming compiler.

EpicEng12y ago

Wrong. sizeof(char) is define to be one. The number of bits in a byte (char) is implementation defined (this is why CHAR_BIT exists). Not the same thing.

danielweber12y ago

More like "ruby optimizes for short strings, and chose 23 at the cut-off point for Reasons."

yapcguy12y ago

Can't wait for someone to write a new faster better string class which handles strings of any length by internally chopping them into 23 character portions....

fat0wl12y ago

lol yes....... how reasonable. ahah when i saw the title of the article all i could think was "click comments to tune in for the most amusing flamewar this week" but so many of these comments are like "so... its 23 characters.... why not?!"

comeon peopleeeee, a bit of an arbitrary internal standard, no?

i understand the point about "CONCLUSION: it doesn't matter for a few strings!" but.... comeonnnn it must matter on some level, otherwise why is Rails such a pain in the ass to optimize? these things must add up...

2 more replies

vidarh12y ago

I hope that is meant as a joke.

pothibo12y ago

More to the point, ruby always uses string with more than 23 character. It's strings that are passed to the client and an HTML page is almost always bigger than 23 characters.

Xylakant12y ago

Ruby is a general purpose scripting language that can be used for web development (rails, sinatra) but is often used for different purposes (puppet, chef, vagrant, shoes, ...).

And even if you'd assume web development as the only purpose, there's a lot of strings that are shorter than 23 characters: Header for request and responses, form fields passed by the client (usernames, passwords, ...), field names passed in hashes, table and column names, template and file names, URLs or even the occasional, totally rare string in a json structure. It's an optimization with major gain and little loss.

1 more reply

ben0x53912y ago

There's some discussion at https://news.ycombinator.com/item?id=3425164 , including some interesting technical/benchmarky comments.

ra88it12y ago

Title: "Never create Ruby strings longer than 23 characters"

Conclusion: "Don’t worry! I don’t think you should refactor all your code to be sure you have strings of length 23 or less."

spoiler12y ago

This is MRI (C Ruby) behaviour and not Ruby - specific , though. However, this is still interesting information.

anon412y ago

Wouldn't it be better to use this declaration though:

    struct RString {

      struct RBasic basic;

      union {
        struct {
          long len;
          char *ptr;
          union {
            long capa;
            VALUE shared;
          } aux;
        } heap;
    
        char ary[];
      } as;
    };

    /* apologies if I messed up the syntax here */
    #define RSTRING_EMBED_LEN_MAX (sizeof(((RString*)(0))->as) - 1)

Then you can even use the padding the compiler added, if any, plus you can add more things to heap and the embed length will grow automatically.

markburns12y ago

For anyone interested, he points to an older translation of the Ruby Hacking Guide, there is a pretty much complete translation at

http://ruby-hacking-guide.github.com

alecdbrooks12y ago

Thanks for the link! I'm not interested in Ruby per se, but it's fascinating nonetheless from the perspective of data structures and how they are implemented in C.

On a related note, I've found a much less comprehensive (but still useful) guide to Python internals: http://tech.blog.aknin.name/category/my-projects/pythons-inn....

gaius12y ago

I suppose the thing to do is analyse your app for the average string length, and just recompile your Ruby with that. Would be even better of it was a command line parameter.

throwaway009412y ago

This isn't quite right. Even if your average string length is 1k+, you shouldn't change the embedded string size to 1k+. I think these objects sit on the C stack internally, which doesn't handle large objects like this well.

Also, I would guess the performance gains (from skipping malloc) would wash out the longer your average string gets -- even if the huge stack use doesn't kill your performance for some other reason (blowing the d-cache?).

ben0x53912y ago

I don't think these strings ever sit on the C stack, except maybe if some C code/extension is being really clever. The standard representation for variables is a tagged pointer as far as I know, so I would assume that is all that goes on the stack. This optimization probably just saves another level of indirection.

1 more reply

pedrocr12y ago

Why does "str2 = str" actually allocate a new RString instead of just pointing both str and str2 to the same RString?

alecdbrooks12y ago

That's what it is doing. The additional RString structure associates the label "str2" with the characters (on the heap) allocated for the original string.

Ruby experts can correct me if I'm wrong, but when Ruby sees a name like "str2" it looks it up in a table, which points it to the RString structure. From there, it can follow the pointer to the actual array of characters, which in this case is only stored once.

pedrocr12y ago

According to the article both str and str2 will point to the same char[] on the heap, but they are represented by two different RString objects. As you said when you want to access str and str2 you need to look them up in a table. So why not have both entries on the table point to the same RString, instead of pointing to two different RString's that point to the same char[]?

1 more reply

pothibo12y ago

I haven't checked the code so I may be wrong but it's possible it's for multi-threading reasons.

microtonal12y ago

MRI has a global interpreter lock, so that does not make much sense.

In fact, the diagram is simply wrong. This was rectified by the author in an article two weeks later:

http://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-s...

grosbisou12y ago

Extremely interesting. But I cannot quite understand why RSTRING_EMBED_LEN_MAX is calculated that way.

VALUE seems to be unsigned int defined via "typedef uintptr_t VALUE;" and "typedef unsigned __int64 uintptr_t;"

But why is it calculated like that I don't get. Anyone can explain?

Sharlin12y ago

The small string buffer should be the same size as the "heap" struct so as not to waste memory -- remember, they shared the memory as they're members of a union. The heap struct contains three members which, taking into accoult alignment restrictions, usually add up to three times the machine word size (which is basically what sizeof(uintptr_t) is). The "-1" is because C strings are null-terminated, so the maximum length is one less than the size of the buffer.

What I don't know is why they don't simply use sizeof(heap) as the buffer size.

grosbisou12y ago

Ah that was obvious. Thanks, very clear answer.

al2o3cr12y ago

It's using the storage in an RString struct that isn't otherwise occupied by the RBasic info:

https://github.com/ruby/ruby/blob/8f77cfb308061ff49de0a47e82...

Note the `as` union. The `heap` version has three VALUE-sized entries, so RSTRING_EMBED_LEN_MAX is calculated accordingly, with the -1 to account for the null terminator.

Dylan1680712y ago

Good question. In a really roundabout way it manages to be the same size as the alternative struct.

Edit: I missed that part of that was another union, removed what I said about it being off on 32 bit.

I still don't understand why they go so roundabout by dividing by one and casting to int...

Sharlin12y ago

Actually in C and C++, longs are 32 bit on most 32-bit platforms. If you need a 64-bit integer type, you need either "long long" or some implementation-specific equivalent.

1 more reply

gesman12y ago

I wonder why they didn't make cut-off optimization points at 33?

When programmers don't know in advance how long name/email/input/whatever field is going to be - they just use the magic "power of two" length :)

So 32 (or 33) in this case would be more reasonable.

gliese133712y ago

Because of this line:

    #define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))

23 wasn't chosen, it was calculated to be the size that would be required for a struct describing a heap string, and will actually be a different number for different architectures. Choosing to make it bigger would add unnecessary overhead to the RString struct.

njharman12y ago

> When programmers don't know

They did know. And there are many more cutoffs than powers of two, depending on storage backend.

badman_ting12y ago

Reminds me of this Mr Show sketch :) https://www.youtube.com/watch?v=RkP_OGDCLY0

throwaway009412y ago

Is Ruby's internal encoding UTF-8, then?

sluukkonen12y ago

Each String in Ruby has their own encoding. But by default, it is UTF-8 these days.

jokoon12y ago

"never use ruby" works well for me

ctrager12y ago

The designer of a string class in any language. C++ and Java - has to deal with the same issue - that heap allocations are slower than stack allocations. But to do a stack allocation means reserving some fixed length memory which is a waste if you have a lot of small strings. It's a tradeoff. The Ruby approach is reasonable. I think in Microsoft's C++ STL library, the limit is 16 rather than 32. Even with the low-level closer-to-the-metal power of C/C++, the string class designer still has to make a decision about the tradeoff.

jokoon12y ago

Strings are overrated, they should never be used until you really need them.

drakaal12y ago

Who needs more than 23?

drakaal12y ago

This comment is also 23

corresation12y ago

This all sounds rather terrible for Ruby, doesn't it? It isn't so much that the short string is faster (though I'm left unclear whether it itself is on the stack/heap, though given the GC nature of Ruby and practical considerations of the language, it must be the heap), but rather that the cost of the short string is also added to the long string in the heap (assumed) allocation of the RString (which becomes larger and thus more difficult to malloc).

If this is intended to sit on the stack, which I find highly unlikely (especially given the timings that seem to be the delta between one malloc and two, and would be much more significant if it were a stack allocation versus a heap allocation. This is not comparable to small string optimizations for the stack in C++), maybe. But otherwise it seems like a poorly considered hack.

The string type could as easily have been dynamically allocated based upon the length of the string, where the ptr by default points inside that same allocated block. If the string is expanded it can then be realloced and the string alloced somewhere else. No waste, a single allocation, etc.

gliese133712y ago

    allocation of the RString (which becomes larger and thus more difficult to malloc), and the 23 string bytes that will sit unused for longer strings.

I got the distinct impression (backed up by an actual code snippet defining the max embedded string size) that the 23 byte limit was calculated to exactly match the size of the data that would otherwise have to be stored for a heap string anyway. Thus, it doesn't actually take any extra space in the struct, and those 23 bytes do not go unused in other strings.

corresation12y ago

You're exactly right on the union: I tried to edit out my error on that before someone noticed (my principal point is about the mallocs), but your comment appeared right as I saved. Shame be upon me.

So this holds (on a 64-bit machine) 8-bytes for the pointer, 8-bytes for the length (string not null terminated), and 8-bytes for the capacity. Alternately, via a union, it stores 24 bytes of string (null terminated). It knows whether it is a or b via a separate flag that it holds separately in RBasic.

I retract my jab about memory loss, but it still sounds rather terrible. Every bit of code dealing with strings needs to validate flags on every use to determine what it is dealing with, alternate between length specified or null terminated, etc. Ugh.

1 more reply

ori_b12y ago

> but rather that the cost of the short string is also added to the long string

The key point is that for short strings, you're not adding it to the cost of the long string, but instead overwriting the data that you would track the long string.

In other words, the memory layout you get is this:

     short:
       [Rbasic]['s', 't', 'r', 'i', 'n', 'g', '\0']
     long:
       [Rbasic][ length ][ dataptr ][capa or value]

Which leads to another interesting question: How does the optimization interact with null in the middle of short strings, since only long strings have a length stored? Does Ruby check for embedded null when creating a string, and disable this optimization?

j / k navigate · click thread line to collapse

59 comments

nly12y ago

This is known as the "small string optimisation" in C++, so you can see a similar implementation in Clangs libc++[1].

[1] http://llvm.org/svn/llvm-project/libcxx/trunk/include/string - search for "union"

Someone12y ago

http://www.slideshare.net/nirusuma/what-lies-beneath-the-bea... (from march 2012) also discusses this.

Also (pedantic):

   #define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))

sizeof(char) is always 1, so that division is superfluous.

BudVVeezer12y ago

Sharlin12y ago

No, the size of char (in bits) is implementation defined, but sizeof(char) is defined to be 1, no matter what its size in bits.

1 more reply

chollida112y ago

No this is incorrect, see:

http://stackoverflow.com/q/4562249/25981

sizeof(char) is always defined to be one. This can't be altered by a conforming compiler.

EpicEng12y ago

Wrong. sizeof(char) is define to be one. The number of bits in a byte (char) is implementation defined (this is why CHAR_BIT exists). Not the same thing.

danielweber12y ago

More like "ruby optimizes for short strings, and chose 23 at the cut-off point for Reasons."

yapcguy12y ago

Can't wait for someone to write a new faster better string class which handles strings of any length by internally chopping them into 23 character portions....

fat0wl12y ago

comeon peopleeeee, a bit of an arbitrary internal standard, no?

2 more replies

vidarh12y ago

I hope that is meant as a joke.

pothibo12y ago

More to the point, ruby always uses string with more than 23 character. It's strings that are passed to the client and an HTML page is almost always bigger than 23 characters.

Xylakant12y ago

Ruby is a general purpose scripting language that can be used for web development (rails, sinatra) but is often used for different purposes (puppet, chef, vagrant, shoes, ...).

1 more reply

ben0x53912y ago

There's some discussion at https://news.ycombinator.com/item?id=3425164 , including some interesting technical/benchmarky comments.

ra88it12y ago

Title: "Never create Ruby strings longer than 23 characters"

Conclusion: "Don’t worry! I don’t think you should refactor all your code to be sure you have strings of length 23 or less."

spoiler12y ago

This is MRI (C Ruby) behaviour and not Ruby - specific , though. However, this is still interesting information.

anon412y ago

Wouldn't it be better to use this declaration though:

    struct RString {

      struct RBasic basic;

      union {
        struct {
          long len;
          char *ptr;
          union {
            long capa;
            VALUE shared;
          } aux;
        } heap;
    
        char ary[];
      } as;
    };

    /* apologies if I messed up the syntax here */
    #define RSTRING_EMBED_LEN_MAX (sizeof(((RString*)(0))->as) - 1)

Then you can even use the padding the compiler added, if any, plus you can add more things to heap and the embed length will grow automatically.

markburns12y ago

For anyone interested, he points to an older translation of the Ruby Hacking Guide, there is a pretty much complete translation at

http://ruby-hacking-guide.github.com

alecdbrooks12y ago

Thanks for the link! I'm not interested in Ruby per se, but it's fascinating nonetheless from the perspective of data structures and how they are implemented in C.

On a related note, I've found a much less comprehensive (but still useful) guide to Python internals: http://tech.blog.aknin.name/category/my-projects/pythons-inn....

gaius12y ago

I suppose the thing to do is analyse your app for the average string length, and just recompile your Ruby with that. Would be even better of it was a command line parameter.

throwaway009412y ago

ben0x53912y ago

1 more reply

pedrocr12y ago

Why does "str2 = str" actually allocate a new RString instead of just pointing both str and str2 to the same RString?

alecdbrooks12y ago

That's what it is doing. The additional RString structure associates the label "str2" with the characters (on the heap) allocated for the original string.

pedrocr12y ago

1 more reply

pothibo12y ago

I haven't checked the code so I may be wrong but it's possible it's for multi-threading reasons.

microtonal12y ago

MRI has a global interpreter lock, so that does not make much sense.

In fact, the diagram is simply wrong. This was rectified by the author in an article two weeks later:

http://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-s...

grosbisou12y ago

Extremely interesting. But I cannot quite understand why RSTRING_EMBED_LEN_MAX is calculated that way.

VALUE seems to be unsigned int defined via "typedef uintptr_t VALUE;" and "typedef unsigned __int64 uintptr_t;"

But why is it calculated like that I don't get. Anyone can explain?

Sharlin12y ago

What I don't know is why they don't simply use sizeof(heap) as the buffer size.

grosbisou12y ago

Ah that was obvious. Thanks, very clear answer.

al2o3cr12y ago

It's using the storage in an RString struct that isn't otherwise occupied by the RBasic info:

https://github.com/ruby/ruby/blob/8f77cfb308061ff49de0a47e82...

Note the `as` union. The `heap` version has three VALUE-sized entries, so RSTRING_EMBED_LEN_MAX is calculated accordingly, with the -1 to account for the null terminator.

Dylan1680712y ago

Good question. In a really roundabout way it manages to be the same size as the alternative struct.

Edit: I missed that part of that was another union, removed what I said about it being off on 32 bit.

I still don't understand why they go so roundabout by dividing by one and casting to int...

Sharlin12y ago

Actually in C and C++, longs are 32 bit on most 32-bit platforms. If you need a 64-bit integer type, you need either "long long" or some implementation-specific equivalent.

1 more reply

gesman12y ago

I wonder why they didn't make cut-off optimization points at 33?

When programmers don't know in advance how long name/email/input/whatever field is going to be - they just use the magic "power of two" length :)

So 32 (or 33) in this case would be more reasonable.

gliese133712y ago

Because of this line:

    #define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))

njharman12y ago

> When programmers don't know

They did know. And there are many more cutoffs than powers of two, depending on storage backend.

badman_ting12y ago

Reminds me of this Mr Show sketch :) https://www.youtube.com/watch?v=RkP_OGDCLY0

throwaway009412y ago

Is Ruby's internal encoding UTF-8, then?

sluukkonen12y ago

Each String in Ruby has their own encoding. But by default, it is UTF-8 these days.

jokoon12y ago

"never use ruby" works well for me

ctrager12y ago

jokoon12y ago

Strings are overrated, they should never be used until you really need them.

drakaal12y ago

Who needs more than 23?

drakaal12y ago

This comment is also 23

corresation12y ago

gliese133712y ago

    allocation of the RString (which becomes larger and thus more difficult to malloc), and the 23 string bytes that will sit unused for longer strings.

corresation12y ago

You're exactly right on the union: I tried to edit out my error on that before someone noticed (my principal point is about the mallocs), but your comment appeared right as I saved. Shame be upon me.

1 more reply

ori_b12y ago

> but rather that the cost of the short string is also added to the long string

The key point is that for short strings, you're not adding it to the cost of the long string, but instead overwriting the data that you would track the long string.

In other words, the memory layout you get is this:

     short:
       [Rbasic]['s', 't', 'r', 'i', 'n', 'g', '\0']
     long:
       [Rbasic][ length ][ dataptr ][capa or value]

j / k navigate · click thread line to collapse