Base64, on the other hand, was carefully designed to survive everything from whitespace corruption to being passed through non-ASCII character sets. And then it became widely used as part of MIME.
Still more robust than uuencode though.
.-_ would have been a better choice tha +/=
There was also an extended period of time where people did uux much as they did shar: both of which are inviting somebody else's hands into your execution state and filestore.
We were also obsessed with efficiency. base64 was "sold" as denser encoding. I can't say if it was true overall, but just as we discussed lempel-zif and gzip tuning on usenet news, we discussed uuencode/base64 and other text wrapping.
Ned Freed, Nathaniel Borenstein, Patrik Falstrom and Robert Elz amongst others come to mind as people who worked on the baseXX encoding and discussed this on the lists at the time. Other alphabets were discussed.
uu* was the product of Mike Lesk a decade before, who was a lot quieter on the lists: He'd moved into different circles, was doing other things and not really that interested in the chatter around line encoding issues.
1) https://www.usenetarchives.com/view.php?id=comp.mail.mime&mi...
> Some of the characters used by uuencode cannot be represented in some of the mail systems used to carry rfc 822 (and therefore MIME) mail messages. Using uuencode in these environments causes corruption of encoded data. The working group that developed MIME felt that reliability of the encoding scheme was more important that compatibility with uuencode.
In a followup (same link):
> "The only character translation problem I have encountered is that the back-quote (`) does not make it through all mailers and becomes a space ( )."
A followup from that at https://www.usenetarchives.com/view.php?id=comp.mail.mime&mi... says:
> The back-quote problem is only one of many. Several of the characters used by uuencode are not present in (for example) the EBCDIC character set. So a message transmitted over BITNET could get mangled -- especially for traffic between two different countries where they use different versions of EBCDIC, and therefore different translate tables between EBCDIC and ASCII. There are other character sets used by 822-based mail systems that impose similar restrictions, but EBCDIC is the most obvious one.
> We didn't use uuencode because several members of our working group had experience with cases where uuencoded files were garbaged in transit. It works fine for some people, but not for "everybody" (or even "nearly everybody").
> The "no standards for uuencode" wasn't really a problem. If we had wanted to use uuencode, we would have documented the format in the MIME RFC.
That last comment was from Keith Moore, "the author and co-author of several IETF RFCs related to the MIME and SMTP protocols for electronic mail, among others" says https://en.wikipedia.org/wiki/Keith_Moore .
uuencode has file headers/footers, like MIME. But the actual content encoding is basically base64 with a different alphabet; both add precisely 1/3 overhead (plus up to 2 padding bytes at the end).
Can anyone explain why BinHex remained "popular" in online Mac communities through to the early 2000s? Why couldn't Macs download "real" binary files back then?
So a common hack was to binhex the .sit file. Binhex was originally designed to make files 7-bit clean, but had the side effect that it bundled the resource fork and the data fork together.
Later versions of StuffIt could open .sit files which lacked the resource fork just fine, but by then .zip was starting to become more common.
I don't really understand why macOS users like this "simple" installation, because when you "uninstall" the app, it leaves all the trash in your system without a chance to clean up. And implying that macOS application somehow will not do "who-knows-what" to your system is just wrong. Docker Desktop is "simple", yet the first thing it does after launch is installing "who-knows-what".
Whereas on macOS, installation is trivial, but then the application sets up stuff upon first run and that is really intransparent then, with no way of properly uninstalling the app unless there is a dedicated uninstaller.
But yeah, the simple case is quite nice.
It's redundant since this info can be fully inferred from the length of the stream.
Even for concatenations it is not necessary to require it, since you must still know the length of each sub stream (and = does not always appear so is not a separator).
There's no way that using the = instead of per-byte length-checking gains any speed, since to prevent reading out of bounds you must check the per byte length anyway, you can't trust input to be a multiple of 4 length.
It could only make sense if it's somehow required to read 4 bytes at once, and you can't possibly read less, but what platform is such?
The padding character is not essential for decoding, since the number of missing bytes can be inferred from the length of the encoded text. In some implementations, the padding character is mandatory, while for others it is not used. An exception in which padding characters are required is when multiple Base64 encoded files have been concatenated.This shows the binary, base64 without padding and base64 with padding:
NULL --> AA --> AA==
NULL NULL --> AAA --> AAA=
NULL NULL NULL --> AAAA --> AAAA
As you can see, all the padding does is make the base64 length a multiple of 4. You already get uniquely distinguishable symbols for the 3 cases (one, two or three NULL symbols) without the ='s, so they are unnecessary
Refer to the "examples" section of the wikipedia page
But I think it's likely just poor design taste.
I'm not sure I understand this part. You can decode aGVsbG8=IHdvcmxk, what do you need to know?
I only mentioned the concatenation because Wikipedia claims this use case requires padding while in reality it doesn't.
Now, 25+ years later, I have some answers - thanks!
If you escape any disallowed character in the usual way for a string ("\0", "\r", "\n", "\\", "\"", "\uD800") then there is no decoding process, all the data in the string will be correct.
If you throw data that is compressed in there, you're unlikely to get very many zeroes, so you can just hope that there aren't too many unmatched surrogate pairs in your binary data, because those get inflated to 6 times their size.
Note that this operates on 16-bit values. In order to see a null, \r, \n, \\ and ", the most significant byte must also be zero, and in order for your data to contain a surrogate pair, you're looking at the two bytes taken together. When the data is compressed, the patterns are less likely.
Using the array-indexing method, the noncontiguity of the characters doesn’t matter, and the processing is also independent of the character encoding (e.g. works exactly the same way in EBCDIC).
https://datatracker.ietf.org/doc/html/rfc2045#section-6.8 says:
This subset has the important property that it is represented
identically in all versions of ISO 646, including US-ASCII, and all
characters in the subset are also represented identically in all
versions of EBCDIC. Other popular encodings, such as the encoding
used by the uuencode utility, Macintosh binhex 4.0 [RFC-1741], and
the base85 encoding specified as part of Level 2 PostScript, do not
share these properties, and thus do not fulfill the portability
requirements a binary transport encoding for mail must meet.
If you want to learn why ASCII is the way it is, try "The Evolution of Character Codes, 1874-1968" at https://archive.org/details/enf-ascii/mode/2up by Eric Fischer (an HN'er). My reading is contiguous A-Z was meant for better compatibility with 6-bit use.Considerably stranger in regard to contiguity was EBCDIC, but it too made sense in terms of its technological requirements, which centered around Hollerith punch cards. https://en.wikipedia.org/wiki/EBCDIC
There are numerous other examples where a lack of knowledge of the technological landscape of the past leads some people to project unwarranted assumptions of incompetence onto the engineers who lived under those constraints.
(Hmmm ... perhaps I should have read this person's profile before commenting.)
And the performance claims are absurd, e.g.,
"A simple and extremely common int->hex string conversion takes twice as many instructions as it would if ASCII was optimized for computability."
WHICH conversion, uppercase hex or lowercase hex? You can't have both. And it's ridiculous to think that the character set encoding should have been optimized for either one or that it would have made a measurable net difference if it had been. And instruction counts don't determine speed on modern hardware. And if this were such a big deal, the conversion could be microcoded. But it's not--there's no critical path with significant amounts of binary to ASCII hex conversion.
"There are also inconsistencies like front and back braces/(angle)brackets/parens not being convertible like the alphabet is."
That is not a usable conversion. Anyone who has actually written parsers knows that the encodings of these characters is not relevant ... nothing would have been saved in parsing "loops". Notably, programming language parsers consume tokens produced by the lexer, and the lexer processes each punctuation character separately. Anything that could be gained by grouping punctuation encodings can be done via the lexer's mapping from ASCII to token values. (I have actually done this to reduce the size of bit masks that determine whether any member of a set of tokens has been encountered. I've even, in my weaker moments, hacked the encodings so that <>, {}, [], and () are paired--but this is pointless premature optimization.)
Again, this fellow's profile is accurate.
Hardware has advanced, but software depends on standards and conventions formulated for far less capable hardware, and that's a problem.
The efficiency of string processing/generation is hugely important in terms of global energy consumption.
A simple and extremely common int->hex string conversion takes twice as many instructions as it would if ASCII was optimized for computability.
Bounds-checking for the English alphabet requires either an upfront normalization or twice the checking, so 50-100% more instructions for that.
There are also inconsistencies like front and back braces/(angle)brackets/parens not being convertible like the alphabet is.
[({< <-> >})] would have been just as or more useful than the alphabet being convertible and saved a few instructions in common parsing loops.
> I never questioned the competence of past engineers
False just based on your opening volley of toxic spew. Backwards compatibility is an engineering decision and it was made by very competent people to interoperate with a large number of systems. The future has never been fucked over.
You seem to not understand how ASCII is encoded. It is primarily based on bit-groups where the numeric ranges for character groupings can be easily determined using very simple (and fast) bit-wise operations. All of the basic C functions to test single-byte characters such as `isalpha()`, `isdigit()`, `islower()`, `isupper()`, etc. use this fact. You can then optimize these into grouped instructions and pipeline them. Pull up `man ascii` and pay attention to the hex encodings at the start of all the major symbol groups. This is still useful today!
No, the biggest fuckage of the internet age has been Unicode which absolutely destroys this mapping. We no longer have any semblance of a 1:1 translation between any set of input bytes and any other set of character attributes. And this is just required to get simple language idioms correct. The best you can do is use bit-groupings to determine encoding errors (ala UTF-8) or stick with a larger translation table that includes surrogates (UTF-16, UTF-32, etc). They will all suffer the same "performance" problem called the "real world".