Base-122 – A space efficient alternative to base-64 (opens in new tab)

(blog.kevinalbs.com)

191 pointskevinAlbs9y ago66 comments

66 comments

http://blog.kevinalbs.com/base122#a_minor_note_on_the_last_c...

Almost a perfect standard, but the prepended one byte header is a mistake IMHO. It makes it impossible to encode when the input size is unknown. Better to encode whether the last chunk is one byte or two at the end of the stream.

Please whoever is involved with this, revise the standard to not have a header and call this existing spec a beta. Otherwise, good work.

Edit: I have opened an issue for this: https://github.com/kevinAlbs/Base122/issues/3#issue-19188159...

dmbarbour9y ago

It would be easy to do. That 3-bit word just needs to start at 1 (since it encodes fewer than 8 options). Then the fixed 1-bit can instead encode whether another 7-bit segment follows.

kevinAlbsOP9y ago

Good point, I will change that.

userbinator9y ago

The use of codepoints below 32 (space, start of what's usually considered "printable") makes me a bit hesitant. A lot of systems won't preserve those characters. Base85 is a more efficient alternative to base64, and doesn't use that lower range:

https://en.wikipedia.org/wiki/Ascii85

jws9y ago

base 85 also has the interesting property that 4 original bytes fit in 5 encoded bytes. Depending on your processor's memory model and the cost of multiplies compared to shifts this can make it the best performer.

This was true on Vax 8200 hardware back in the day. In the same software, with Huffman decoding of JPEGs it was also fastest to create a finite state machine with an 8 bit symbol size. I suspect that is no longer true since it would kill your L1 cache and be well into your L2 cache on modern x86 machines. It is probably better to take the instruction count hit and process as bits or nibbles.

feadog9y ago

base 85 also has the interesting property that 4 original bytes fit in 5 encoded bytes.

Still useful for Javascript, as the bit shift operators work on 32 bit "registers".

ChuckMcM9y ago

I was going to bring up base 85 as well, its a better choice for a variety of reasons. A long time ago I wrote a base encoder class in Java[1] mostly so that we could write a netnews reader in Java but also because I felt UUEncoding was not robust. The challenges of using unprintable characters is a lot more of a headache than anyone pays attention to initially. Lots (and I mean quite a few here) of systems consider unprintable characters "safe" to re-purpose into random uses. One display vendor had them changing the color of future characters in the display as an example.

Stick with the characters that nearly everyone assumes could legitimately come up in a document and your chances of running afoul of some "creative genius" who decided "Hey its unprintable so no one will try to print it, but when I do print it I want this thing to happen..."

[1] http://grepcode.com/file/repository.grepcode.com/java/root/j...

feadog9y ago

I wrote a base 92 encoder for the Javascript game I'm working on:

http://www.emergencevector.com/

It's pretty easy to write the decode for the 0-91 integer in Javascript.

    if (ch == "!") {
        return 57;
    } else {
        return ch.charCodeAt(0) - 35;
    }

It doesn't give you that much usable compactness over base 64, though you can easily encode a 360 degree angle with two bits of precision lost. Also, 5 base 92 characters can fully encode 32 bits of binary data. (Of course, since base 85 can do it in 5 characters.)

I'm probably going to go to typed arrays of 32 bit values. Currently, I can encode an entire ship's data in 18 bytes, of which 4 characters is a hash id.

billforsternz9y ago

I've never come across this one, thanks for the tip. I like the way it doesn't use every character in the range 0x20 to 0x7f. That last one in particular (0x7f or 'DEL') has always seemed problematic to me because it's a weird exceptional case. I've always thought someone must have screwed up with that one way back in the day. Using only '!' to 'u' and hence avoiding ' ' feels right for some reason, and I also like the cute trick of getting just a little bit of compression by using 'z' to represent 32 zero bits.

throwbsidbdk9y ago

Holla for ASCII85. Used it many times when some of the base64 chars are unusable. A nice thing to know about when working with old systems

poizan429y ago

A lot of people have experimented with a lot of different ways of encoding binary data as printable text. Wikipedia has a list of different encoding schemes[0].

The most efficient one is yEnc[1]. Still the simplest ones such as base64 or good old hex may actually work better once compression comes into the picture.

[0]: https://en.wikipedia.org/wiki/Binary-to-text_encoding

[1]: https://en.wikipedia.org/wiki/YEnc

NelsonMinar9y ago

It's crucial to evaluate encoding space usage in the context of compression. For instance gzip(base16(data)) is often smaller than gzip(base64(data)) for practical data. Even though base64 is more efficient than base16, it breaks up data across byte boundaries which then makes gzip significantly less efficient.

bunderbunder9y ago

When would you gzip encoded data instead of encoding gzipped data? Doesn't gzip after encoding defeat the whole idea of encoding the data in a format that won't get mangled by systems that expect to be handling text?

2 more replies

bascule9y ago

Came here to post this... yEnc is about as compact as you can get, and has been around for over a decade.

Besides that, the Z85 encoding is the next runner up as a compact "string safe" encoding: https://rfc.zeromq.org/spec:32/Z85/

aidenn09y ago

YEnc is unsuitable when the text will be transfered as UTF-8.

throwbsidbdk9y ago

I appreciate the author building a more efficient alternative to base64, but I had to laugh for a moment at the suggested use case.

Using an alternative to base64 encoded data for web pages and node is a horrible idea. If I came across a code base using that I would scream. A lot. None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast. Its a 1-2kb savings per page that's going to make somebody jump off bridge one day.

More terrifying is that the JavaScript community is so fascinated by shiny objects that thousands of people are going to use this. I'm not sure if it's more funny or terrifying.

Anyways, this encoding is still useful if you need to pass data between legacy systems. I've used html escapes and ASCII85 before to get around annoying old stuff that doesn't use Unicode many times

RobIII9y ago

> None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast.

It's not that I don't agree with the general idea of your post, but I disagree with this part. If nobody every tries something new we'd still be in the stone age. I think base122 is a bit silly but if it had it's use and adoption became widespread then browsers and tooling etc. would follow, just as they now support base64.

daenney9y ago

Does anyone else find it odd that both

Base-122 was created with the web in mind.

And

As §3 shows, base-122 is not recommended to be used on gzip compressed pages, which is the majority of served web pages.

Occur in just a few lines from each other? I get there's more use cases like email and such but if you're going to create something for the web but it can't be used on the majority of web pages that seems like a fairly large oversight/caveat.

gayprogrammer9y ago

I agree. Including:

> Base-122 encoded strings contain characters which did not seem to play well with copy-pasting.

A very important part of web development is being able to manipulate text documents. It seems that using UTF-8 in more places can reveal cracks in implementations for browsers/DE's/editors/terminals/etc.

CiPHPerCoder9y ago

My problem with base-122 is simply that it's not an even power of 2.

It's very easy to write a cache-timing-safe version of base{16,32,64} encoding for use in encoding/decoding cryptographic keys in configuration files. To wit: https://github.com/paragonie/constant_time_encoding

Base-122? Not sure if it's even possible.

kobeya9y ago

I'm pretty sure that Bitcoin has dealt with this issue for its base58 encoding. It might be worth checking if their algorithm is generalizable to other radix sizes.

deckar019y ago

Check out digit-array if you are interested in the generalized algorithm. The source code is commented with formal math notation for the operations.

https://github.com/deckar01/digit-array/blob/master/README.m...

1 more reply

CiPHPerCoder9y ago

No, it hasn't. I sent a pull request that attempts to solve this problem for a C# implementation of Base58Check, but there's no way for me to reliably guarantee that data isn't leaking from the divison.

https://github.com/adamcaudill/Base58Check/pull/3

deckar019y ago

I think you are focusing on the wrong number.

Each byte of base64 produces 6 bits of data, so the boundary aligns at 32 bits. LCM(6,8) = (6•8)/2

Each byte of base122 produces 7 bits of data, so the byte boundary aligns at 56 bits. LCM(7,8) = (7•8)/1

Edit: Due to the variable length encoding, there is no guarantee of byte alignment.

beagle39y ago

If it was base 128 it Would have produced 7 bits. But as it is 122 it produces 6.8 bits, hence a problem.

1 more reply

theoh9y ago

A number system can have any natural number as radix, so why not? https://en.m.wikipedia.org/wiki/Radix

From the posted article: "This leaves us with 122 legal one-byte UTF-8 characters to use"

Seems legit to me.

dom09y ago

> It's very easy to write a cache-timing-safe version of base{16,32,64} encoding ...

> Base-122? Not sure if it's even possible.

Parent is clearly referring to his previous statement, not that Base-122 itself is not possible.

1 more reply

xg159y ago

It's an interesting technical exercise, but I don't think this is the right approach for optimizing HTML load times.

If the goal is to reduce latency for small images, wouldn't it make it more sense to extend data URIs so the same base64 string can be referenced in multiple places?

Actually, as HTTP2 can effectively return multiple resurces in the answer of one request, do we still need embedded images for latency reduction at all?

paulddraper9y ago

Compression takes care of exactly this.

xg159y ago

Unless compression cannot be used: https://news.ycombinator.com/item?id=13049898

Also, there are still cases where compression cannot be applied, e.g. if a script naively queries innerHTML. (This wouldn't affect loading time but it could inflate the page's RAM unnecessarily)

1 more reply

zihotki9y ago

tldr, no revolution, base64+gzip still far better than base122+gzip. But the article worth reading.

sikosmurf9y ago

Aren't there security concerns using gzip and HTTPS? Could this be a space efficient alternative when gzip is disabled?

krschultz9y ago

I would be interested to learn more about those concerns. My "default" setup these days is HTTPS & gzip everything, but I can't say I've read any white papers on the security implications of that.

1 more reply

wildfire9y ago

Something that is isn't really discussed is that the need for base64 arose so that binary data could transit 7-bit email systems.

Base122 does not have that property - as far as I could tell.

Whilst we are all, basically, living in an 8-bit world; I suspect it will be sometime before people feel comfortable assuming that an 8-bit transport is viable over email.

[edit: spelling ]

faragon9y ago

Base85 is between base64 and base122, while using printable characters. Any statistics of compressed base85?

Dylan168079y ago

I don't have an easy way to try a test corpus but I fed this comment page into it.

  Raw size:     37409
  gzip(raw):    6170
  gzip(base16): 7482
  gzip(base64): 10675
  gzip(base85): 10549

So it works better than base64, and it has the advantage of working 4 bytes at a time rather than 3.

Also note that when you're feeding in binary data the gzip sizes for raw and base-xx data get a lot closer together.

michaelsbradley9y ago

There's also basE91:

http://base91.sourceforge.net/

lalaithion9y ago

Note that if you're not using UTF-8 to encode your characters, you can beat Base64, Base85, and Base122 with https://github.com/ferno/base65536

jwatte9y ago

Berkeley Unix (or was that SunOS?) used to cone with btoa that used base 85, packing 4 bytes into 5 uucp/mail/Usenet safe characters. For some reason, base 64 became the standard, and I don't think there's space for anything else these days. (Most things do just fine with binary now, anyway!)

Rafert9y ago

Interesting, but please add a license :)

diafygi9y ago

Why not embrace the Unicode world we live in now and just offset bytes to a Unicode region that has 256 non-space characters?

https://github.com/diafygi/Offset248

Pros: simple, no ratio or ending calculation, easy copy/paste, same character length as original bytes

Cons: has to be percent encoded in urls, some fonts might not render all characters

timdierks9y ago

In utf8, these characters are all encoded as two bytes. So this encoding makes the byte count twice as long, it's no more efficient (in bytes) than hex encoding.

randomslob9y ago

Base64 is widespread and works fine, while nobody groks base122.

Base122 requires UTF-8, and while that's pretty common, it's not universal, so base64 can't ever go away in favor of base122.

Compressed base64 is more efficient than base122 (with or without compression).

Conclusion: Big nope.

web0079y ago

Are single-quotes allowed in the spec?

There is an explicit call out for "double quote == bad", but single quotes are also valid property delimeters in HTML.

majewsky9y ago

If your property string is enclosed in a double quote, then a single quote in the payload is fine. (Otherwise, a lot of inline JS in onclick etc. attributes would break. JS allows both types of quotes on string literals for exactly this reason.)

Still, single quotes are somewhat asking for trouble.

kevinAlbsOP9y ago

Base-122 encoded strings may contain single quotes but not double quotes. The choice between the two was arbitrary.

Hydraulix9899y ago

I always wondered why people didn't do this.

amelius9y ago

I'm looking for an efficient encoding that is guaranteed to not contain certain substrings, such as improper language.

jzwinck9y ago

You can probably get most of what you want by simply not using vowels. A few non-English ones will still get through like knnbccb and there's also shlt and so on....

voltagex_9y ago

Define improper language. Does your improper-language dictionary cover every possible translation of every possible "improper" word?

logicallee9y ago

meanwhile the Unicode consortium has been hard at work since 1991 to make it possible to encode up to 2.8 MB -- more than enough for most images, short videos, or many PDF files -- in a single character.

qwename9y ago

Are you referring to UTF-8? If so, this is misleading as you can encode up to 2^21 + 2^16 + 2^11 + 2^7 = 2,164,864 code points, which is not the same as encoding bytes in a single character.

logicallee9y ago

I was making a joke about how ridiculously large Unicode is. Obviously it is not 2 million+ bytes per character! (But the fact that you didn't consider my joke obvious speaks volumes).

j / k navigate · click thread line to collapse

66 comments

zackmorris9y ago

http://blog.kevinalbs.com/base122#a_minor_note_on_the_last_c...

Please whoever is involved with this, revise the standard to not have a header and call this existing spec a beta. Otherwise, good work.

Edit: I have opened an issue for this: https://github.com/kevinAlbs/Base122/issues/3#issue-19188159...

dmbarbour9y ago

It would be easy to do. That 3-bit word just needs to start at 1 (since it encodes fewer than 8 options). Then the fixed 1-bit can instead encode whether another 7-bit segment follows.

kevinAlbsOP9y ago

Good point, I will change that.

userbinator9y ago

https://en.wikipedia.org/wiki/Ascii85

jws9y ago

feadog9y ago

base 85 also has the interesting property that 4 original bytes fit in 5 encoded bytes.

Still useful for Javascript, as the bit shift operators work on 32 bit "registers".

ChuckMcM9y ago

[1] http://grepcode.com/file/repository.grepcode.com/java/root/j...

feadog9y ago

I wrote a base 92 encoder for the Javascript game I'm working on:

http://www.emergencevector.com/

It's pretty easy to write the decode for the 0-91 integer in Javascript.

    if (ch == "!") {
        return 57;
    } else {
        return ch.charCodeAt(0) - 35;
    }

I'm probably going to go to typed arrays of 32 bit values. Currently, I can encode an entire ship's data in 18 bytes, of which 4 characters is a hash id.

billforsternz9y ago

throwbsidbdk9y ago

Holla for ASCII85. Used it many times when some of the base64 chars are unusable. A nice thing to know about when working with old systems

poizan429y ago

A lot of people have experimented with a lot of different ways of encoding binary data as printable text. Wikipedia has a list of different encoding schemes[0].

The most efficient one is yEnc[1]. Still the simplest ones such as base64 or good old hex may actually work better once compression comes into the picture.

[0]: https://en.wikipedia.org/wiki/Binary-to-text_encoding

[1]: https://en.wikipedia.org/wiki/YEnc

NelsonMinar9y ago

bunderbunder9y ago

2 more replies

bascule9y ago

Came here to post this... yEnc is about as compact as you can get, and has been around for over a decade.

Besides that, the Z85 encoding is the next runner up as a compact "string safe" encoding: https://rfc.zeromq.org/spec:32/Z85/

aidenn09y ago

YEnc is unsuitable when the text will be transfered as UTF-8.

throwbsidbdk9y ago

I appreciate the author building a more efficient alternative to base64, but I had to laugh for a moment at the suggested use case.

More terrifying is that the JavaScript community is so fascinated by shiny objects that thousands of people are going to use this. I'm not sure if it's more funny or terrifying.

Anyways, this encoding is still useful if you need to pass data between legacy systems. I've used html escapes and ASCII85 before to get around annoying old stuff that doesn't use Unicode many times

RobIII9y ago

> None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast.

daenney9y ago

Does anyone else find it odd that both

Base-122 was created with the web in mind.

And

As §3 shows, base-122 is not recommended to be used on gzip compressed pages, which is the majority of served web pages.

gayprogrammer9y ago

I agree. Including:

> Base-122 encoded strings contain characters which did not seem to play well with copy-pasting.

CiPHPerCoder9y ago

My problem with base-122 is simply that it's not an even power of 2.

Base-122? Not sure if it's even possible.

kobeya9y ago

I'm pretty sure that Bitcoin has dealt with this issue for its base58 encoding. It might be worth checking if their algorithm is generalizable to other radix sizes.

deckar019y ago

Check out digit-array if you are interested in the generalized algorithm. The source code is commented with formal math notation for the operations.

https://github.com/deckar01/digit-array/blob/master/README.m...

1 more reply

CiPHPerCoder9y ago

https://github.com/adamcaudill/Base58Check/pull/3

deckar019y ago

I think you are focusing on the wrong number.

Each byte of base64 produces 6 bits of data, so the boundary aligns at 32 bits. LCM(6,8) = (6•8)/2

Each byte of base122 produces 7 bits of data, so the byte boundary aligns at 56 bits. LCM(7,8) = (7•8)/1

Edit: Due to the variable length encoding, there is no guarantee of byte alignment.

beagle39y ago

If it was base 128 it Would have produced 7 bits. But as it is 122 it produces 6.8 bits, hence a problem.

1 more reply

theoh9y ago

A number system can have any natural number as radix, so why not? https://en.m.wikipedia.org/wiki/Radix

From the posted article: "This leaves us with 122 legal one-byte UTF-8 characters to use"

Seems legit to me.

dom09y ago

> It's very easy to write a cache-timing-safe version of base{16,32,64} encoding ...

> Base-122? Not sure if it's even possible.

Parent is clearly referring to his previous statement, not that Base-122 itself is not possible.

1 more reply

xg159y ago

It's an interesting technical exercise, but I don't think this is the right approach for optimizing HTML load times.

If the goal is to reduce latency for small images, wouldn't it make it more sense to extend data URIs so the same base64 string can be referenced in multiple places?

Actually, as HTTP2 can effectively return multiple resurces in the answer of one request, do we still need embedded images for latency reduction at all?

paulddraper9y ago

Compression takes care of exactly this.

xg159y ago

Unless compression cannot be used: https://news.ycombinator.com/item?id=13049898

Also, there are still cases where compression cannot be applied, e.g. if a script naively queries innerHTML. (This wouldn't affect loading time but it could inflate the page's RAM unnecessarily)

1 more reply

zihotki9y ago

tldr, no revolution, base64+gzip still far better than base122+gzip. But the article worth reading.

sikosmurf9y ago

Aren't there security concerns using gzip and HTTPS? Could this be a space efficient alternative when gzip is disabled?

krschultz9y ago

I would be interested to learn more about those concerns. My "default" setup these days is HTTPS & gzip everything, but I can't say I've read any white papers on the security implications of that.

1 more reply

wildfire9y ago

Something that is isn't really discussed is that the need for base64 arose so that binary data could transit 7-bit email systems.

Base122 does not have that property - as far as I could tell.

Whilst we are all, basically, living in an 8-bit world; I suspect it will be sometime before people feel comfortable assuming that an 8-bit transport is viable over email.

[edit: spelling ]

faragon9y ago

Base85 is between base64 and base122, while using printable characters. Any statistics of compressed base85?

Dylan168079y ago

I don't have an easy way to try a test corpus but I fed this comment page into it.

  Raw size:     37409
  gzip(raw):    6170
  gzip(base16): 7482
  gzip(base64): 10675
  gzip(base85): 10549

So it works better than base64, and it has the advantage of working 4 bytes at a time rather than 3.

Also note that when you're feeding in binary data the gzip sizes for raw and base-xx data get a lot closer together.

michaelsbradley9y ago

There's also basE91:

http://base91.sourceforge.net/

lalaithion9y ago

Note that if you're not using UTF-8 to encode your characters, you can beat Base64, Base85, and Base122 with https://github.com/ferno/base65536

jwatte9y ago

Rafert9y ago

Interesting, but please add a license :)

diafygi9y ago

Why not embrace the Unicode world we live in now and just offset bytes to a Unicode region that has 256 non-space characters?

https://github.com/diafygi/Offset248

Pros: simple, no ratio or ending calculation, easy copy/paste, same character length as original bytes

Cons: has to be percent encoded in urls, some fonts might not render all characters

timdierks9y ago

In utf8, these characters are all encoded as two bytes. So this encoding makes the byte count twice as long, it's no more efficient (in bytes) than hex encoding.

randomslob9y ago

Base64 is widespread and works fine, while nobody groks base122.

Base122 requires UTF-8, and while that's pretty common, it's not universal, so base64 can't ever go away in favor of base122.

Compressed base64 is more efficient than base122 (with or without compression).

Conclusion: Big nope.

web0079y ago

Are single-quotes allowed in the spec?

There is an explicit call out for "double quote == bad", but single quotes are also valid property delimeters in HTML.

majewsky9y ago

Still, single quotes are somewhat asking for trouble.

kevinAlbsOP9y ago

Base-122 encoded strings may contain single quotes but not double quotes. The choice between the two was arbitrary.

Hydraulix9899y ago

I always wondered why people didn't do this.

amelius9y ago

I'm looking for an efficient encoding that is guaranteed to not contain certain substrings, such as improper language.

jzwinck9y ago

You can probably get most of what you want by simply not using vowels. A few non-English ones will still get through like knnbccb and there's also shlt and so on....

voltagex_9y ago

Define improper language. Does your improper-language dictionary cover every possible translation of every possible "improper" word?

logicallee9y ago

qwename9y ago

Are you referring to UTF-8? If so, this is misleading as you can encode up to 2^21 + 2^16 + 2^11 + 2^7 = 2,164,864 code points, which is not the same as encoding bytes in a single character.

logicallee9y ago

I was making a joke about how ridiculously large Unicode is. Obviously it is not 2 million+ bytes per character! (But the fact that you didn't consider my joke obvious speaks volumes).

j / k navigate · click thread line to collapse