undefined | Better HN

0 pointssimonask9mo ago0 comments

This is American imperialism at its worst. I'm serious.

Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job.

Enforcing ASCII is the same as enforcing English. How would you feel if all cooking recipes were written in French? If all music theory was in Italian? If all industrial specifications were in German?

It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws.

0 comments

0x000xca0xfe9mo ago

Well I'm not American and I can tell you that we do not see English source code as imperialism.

In fact it's awesome that we have one common very simple character set and language that works everywhere and can do everything.

I have only encountered source code using my native language (German) in comments or variable names in highly unprofessional or awful software and it is looked down upon. You will always get an ugly mix and have to mentally stop to figure out which language a name is in. It's simply not worth it.

Please stop pushing this UTF-8 everywhere nonsense. Make it work great on interactive/UI/user facing elements but stop putting UTF-8-only restrictions in low-level software. Example: Copied a bunch of ebooks to my phone, including one with a mangled non-UTF-8 name. It was ridiculously hard to delete the file as most Android graphical and console tools either didn't recognize it or crashed.

flohofwoe9mo ago

> Please stop pushing this UTF-8 everywhere nonsense.

I was with you until this sentence. UTF-8 everywhere is great exactly because it is ASCII-compatible (e.g. all ASCII strings are automatically also valid UTF-8 strings, so UTF-8 is a natural upgrade path from ASCII) - both are just encodings for the same UNICODE codepoints, ASCII just cannot go beyond the first 127 codepoints, but that's where UTF-8 comes in and in a way that's backward compatible with ASCII - which is the one ingenious feature of the UTF-8 encoding.

0x000xca0xfe9mo ago

I'm not advocating for ASCII-everywhere, I'm for bytes-everywhere.

And bytes can conveniently fit both ASCII and UTF-8.

If you want to restrict your programming language to ASCII for whatever reason, fine by me. I don't need "let wohnt_bei_Böckler_STRAẞE = ..." that much.

But if you allow full 8-bit bytes, please don't restrict them to UTF-8. If you need to gracefully handle non-UTF-8 sequences graphically show the appropriate character "�", otherwise let it pass through unmodified. Just don't crash, show useless error messages or in the worst case try to "fix" it by mangling the data even more.

flohofwoe9mo ago

> "let wohnt_bei_Böckler_STRAẞE"

This string cannot be encoded as ASCII in the first place.

> But if you allow full 8-bit bytes, please don't restrict them to UTF-8

UTF-8 has no 8-bit restrictions... You can encode any 21-bit UNICODE codepoint with UTF-8.

It sound's like you're confusing ASCII, Extended ASCII and UTF-8:

- ASCII: 7-bits per "character" (e.g. not able to encode international characters like äöü) but maps to the lower 7-bits of the 21-bits of UNICODE codepoints (e.g. all ASCII character codes are also valid UNICODE code points)

- Extended ASCII: 8-bits per "character" but the interpretation of the upper 128 values depends on a country-specific codepage (e.g. the intepretation of a byte value in the range between 128 and 255 is different between countries and this is what causes all the mess that's usually associated with "ASCII". But ASCII did nothing wrong - the problem is Extended ASCII - this allows to 'encode' äöü with the German codepage but then shows different characters when displayed with a non-German codepage)

- UTF-8: a variable-width encoding for the full range of UNICODE codepoints, uses 1..4 bytes to encode one 21-bit UNICODE codepoint, and the 1-byte encodings are identical with 7-bit ASCII (e.g. when the MSB of a byte in an UTF-8 string is not set, you can be sure that it is a character/codepoint in the ASCII range).

Out of those three, only Extended ASCII with codepages are 'deprecated' and should no longer be used, while ASCII and UTF-8 are both fine since any valid ASCII encoded string is indistinguishable from that same string encoded as UTF-8, e.g. ASCII has been 'retconned' into UTF-8.

2 more replies

numpad09mo ago

UTF-8 everywhere is not great and UTF-8 in practice is hardly ASCII-compatible. UTF-8 in source codes and file paths outside pure ASCII range breaks a lot of things especially on non-English systems due to legacy dependencies, ironically.

Sure, it's backward compatible, as in ASCII handling codes work on systems with UTF-8 locales, but how important is that?

flohofwoe9mo ago

> as in ASCII handling codes work on systems with UTF-8 locales, but how important is that?

It's only Windows which is stuck in the past here, and Microsoft had 3 decades to fix that problem and migrate away from codegpages to locale-asgnostic UTF-8 (UTF-8 was invented in 1992).

Certhas8mo ago

So source code needs to be UTF8 because it contains comments and string literals. And filenames need to be bytes. It seems that both of these are orthogonal to the question of whether non-ASCII code is desirable...

Restricting the program part to ASCII is fine for me, but as a fellow German it's also important to recognize that we don't loose much by not having ä cömplete sät of letters. Everyone can write comprehensible German using ASCII characters only. So I would listen to what people from languages that really don't fit into ASCII have to say.

BobbyTables29mo ago

I once saw an electrical schematic from a non-English speaking designer.

None of the signals were intuitive because they weren’t the typical English abbreviations!

sussmannbaka9mo ago

You say this because your native language broadly fits into ascii and you would sing a different tune if it didn’t.

jibal9mo ago

It's neither American nor imperialism -- those are both category mistakes.

Andreas Rumpf, the designer of Nim, is Austrian. All the keywords of Nim are in English, the library function names are in English, the documentation is in English, Rumpf's book Mastering Nim is in English, the other major book for the language, Nim In Action (written by Dominik Picheta, nationality unknown but not American) is in English ... this is not "American imperialism" (which is a real thing that I don't defend), it's for easily understandable pragmatic reasons. And the language parser doesn't disallow non-ASCII characters but it doesn't treat them linguistically, and it has special rules for casefolding identifiers that only recognize ASCII letters, hobbling the use of non-ASCII identifiers because case distinguishes between types and other identifiers. The reason for this lack of handling of Unicode linguistically is simply to make the lexer smaller and faster.

rurban9mo ago

> The reason for this lack of handling of Unicode linguistically is simply to make the lexer smaller and faster.

No, it is actually for security reasons. Once you allow non-ASCII identifiers, identifiers will become non identifiable. Only zig recognized that. Nim allows insecure identifiers. https://github.com/rurban/libu8ident/blob/master/doc/c11.md#...

jibal9mo ago

Reading is fundamental. I was referring to the Nim lexer. Obviously the reason that it "allows insecure identifiers" is not "actually for security reasons". It is, as I stated, for reasons of performance ... I know this from reading the code and the author's statements.

rurban9mo ago

Yes, you are right. Andi didn't care at all, same as PHP.

jibal9mo ago

P.S. The response is a https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy

The motte: non-ASCII identifiers should be allowed

The bailey: disallowing non-ASCII identifiers is American imperialism at its worst

simonaskOP9mo ago

I mean, the keywords of a programming language have to be in some language (unless you go the cursed route of Excel). I'm arguing against the position that non-ASCII identifiers should be disallowed.

lsaferite9mo ago

> I'm arguing against the position that non-ASCII identifiers should be disallowed.

Maybe I'm tired, but I've read this multiple times and can't quite figure out your desired position.

I *think* you are in favor of non -ASCII identifiers?

Like I said, I must be tired.

jibal9mo ago

He says that disallowing non-ASCII identifiers is "American imperialism at its worst".

account429mo ago

Actually, it would be great to have a lingua franca in every field that all participants can understand. Are you also going to complain that biologists and doctors are expected to learn some rudimentary Latin? English being dominant in computing is absolutely a strength and we gain nothing by trying to combat that. Having support for writing your code in other languages is not going to change that most libraries will use English and and most documentation will be in English and most people you can ask for help will understand English. If you want to participate and refuse to learn English you are only shooting yourself in the foot - and if you are going to learn English you may as well do it from the beginning. Also due to the dominance of English and ASCII in computing history, most languages already have ASCII-alternatives for their writing so even if you need to refer to non-English names you can do that using only ASCII.

simonaskOP9mo ago

Well, the problem is that what you are advocating is also that knowing Latin would be a prerequisite for studying medicine, which it isn't anywhere. That's the equivalent. Doctors learn a (very limited) Latin vocabulary as they study and work.

You are severely underestimate how far you can get without any real command of the English language. I agree that you can't become really good without it, just like you can't do haute cuisine without some French, but the English language is a huge and unnecessary barrier of entry that you would put in front of everyone in the world who isn't submerged in the language from an early age.

Imagine learning programming using only your high school Spanish. Good luck.

nitwit0059mo ago

You don't need to become fluent in Greek and Latin, but if you want to be able to read your patent's diagnosis, you're absolutely going to need to know the terms used. The standard names are in those languages.

And frequently, there is no other name. There are a lot of diseases, and no language has names for all of them.

simonaskOP9mo ago

Sure, you can also look them up though, because it is a limited vocabulary.

Identifiers in code are not a limited vocabulary, and understanding the structure of your code is important, especially so when you are in the early stages of learning.

numpad09mo ago

> Imagine learning programming using only your high school Spanish. Good luck.

This + translated materials + locally written books is how STEM fields work in East Asia, the odds of success shouldn't be low. There just needs to be enough population using your language.

flohofwoe9mo ago

Calm down, ASCII is a UNICODE compatible encoding for the first 127 UNICODE code points (which maps directly to the entire ASCII range). If you need to go beyond that, just 'upgrade' to UTF-8 encoding.

UNICODE is essentially a superset of ASCII, and the UTF-8 encoding also contains ASCII as compatible subset (e.g. for the first 127 UNICODE code points, an UTF-8 encoded string is byte-by-byte compatible with the same string encoded in ASCII).

Just don't use any of the Extended ASCII flavours (e.g. "8-bit ASCII with codepages") - or any of the legacy 'national' multibyte encodings (Shift-JIS etc...) because that's how you get the infamous `?????` or `♥♥♥♥♥` mismatches which are commonly associated with 'ASCII' (but this is not ASCII, but some flavour of Extended ASCII decoded with the wrong codepage).

ksenzee9mo ago

I don’t see much difference between the amount of Italian you need for music and the amount of English you need for programming. You can have a conversation about it in your native language, but you’ll be using a bunch of domain-specific terms that may not be in your native language.

simonaskOP9mo ago

I agree, but we're talking about identifiers in code you write yourself here. Not the limited vocabulary of keywords, which are easy to memorize in any language. Standard libraries may trip you up, but documentation for those may be available in your native language.

nkrisc9mo ago

There was a time when most scientific literature was written in French. People learned French. Before that it was Latin. People learned Latin.

tehjoker9mo ago

This is true but it’s important to recognize that this was because of the French (Napoleon) and Roman empires, Christianity just as the brutal American and UK empires created these circumstances today

wredcoll9mo ago

The napoleonic empire lasted about 15 years, so that's a bit of a stretch.

More relevantly though, good things can come from people who also did bad things; this isn't to justify doing bad things in hopes something good also happens, but it doesn't mean we need to ideologically purge good things based on their creators.

schrodinger9mo ago

American Imperialism has absolutely resulted in some horrible things, but I hardly think that ASCII is one of them.

ASCII wasn't "imperialism," it was pragmatism. Yes, it privileged English -- but that's because the engineers designing it _spoke_ English and the US was funding + exporting most of the early computer and networking gear. The US Military essentially gave the world TCP/IP (via DARPA) for free!

Maybe "cultural dominance", but "imperialism at its worst" is a ridiculous take.

j / k navigate · click thread line to collapse

0 comments

0x000xca0xfe9mo ago

Well I'm not American and I can tell you that we do not see English source code as imperialism.

In fact it's awesome that we have one common very simple character set and language that works everywhere and can do everything.

flohofwoe9mo ago

> Please stop pushing this UTF-8 everywhere nonsense.

0x000xca0xfe9mo ago

I'm not advocating for ASCII-everywhere, I'm for bytes-everywhere.

And bytes can conveniently fit both ASCII and UTF-8.

If you want to restrict your programming language to ASCII for whatever reason, fine by me. I don't need "let wohnt_bei_Böckler_STRAẞE = ..." that much.

flohofwoe9mo ago

> "let wohnt_bei_Böckler_STRAẞE"

This string cannot be encoded as ASCII in the first place.

> But if you allow full 8-bit bytes, please don't restrict them to UTF-8

UTF-8 has no 8-bit restrictions... You can encode any 21-bit UNICODE codepoint with UTF-8.

It sound's like you're confusing ASCII, Extended ASCII and UTF-8:

2 more replies

numpad09mo ago

Sure, it's backward compatible, as in ASCII handling codes work on systems with UTF-8 locales, but how important is that?

flohofwoe9mo ago

> as in ASCII handling codes work on systems with UTF-8 locales, but how important is that?

It's only Windows which is stuck in the past here, and Microsoft had 3 decades to fix that problem and migrate away from codegpages to locale-asgnostic UTF-8 (UTF-8 was invented in 1992).

Certhas8mo ago

BobbyTables29mo ago

I once saw an electrical schematic from a non-English speaking designer.

None of the signals were intuitive because they weren’t the typical English abbreviations!

sussmannbaka9mo ago

You say this because your native language broadly fits into ascii and you would sing a different tune if it didn’t.

jibal9mo ago

It's neither American nor imperialism -- those are both category mistakes.

rurban9mo ago

> The reason for this lack of handling of Unicode linguistically is simply to make the lexer smaller and faster.

jibal9mo ago

rurban9mo ago

Yes, you are right. Andi didn't care at all, same as PHP.

jibal9mo ago

P.S. The response is a https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy

The motte: non-ASCII identifiers should be allowed

The bailey: disallowing non-ASCII identifiers is American imperialism at its worst

simonaskOP9mo ago

lsaferite9mo ago

> I'm arguing against the position that non-ASCII identifiers should be disallowed.

Maybe I'm tired, but I've read this multiple times and can't quite figure out your desired position.

I *think* you are in favor of non -ASCII identifiers?

Like I said, I must be tired.

jibal9mo ago

He says that disallowing non-ASCII identifiers is "American imperialism at its worst".

account429mo ago

simonaskOP9mo ago

Imagine learning programming using only your high school Spanish. Good luck.

nitwit0059mo ago

And frequently, there is no other name. There are a lot of diseases, and no language has names for all of them.

simonaskOP9mo ago

Sure, you can also look them up though, because it is a limited vocabulary.

Identifiers in code are not a limited vocabulary, and understanding the structure of your code is important, especially so when you are in the early stages of learning.

numpad09mo ago

> Imagine learning programming using only your high school Spanish. Good luck.

This + translated materials + locally written books is how STEM fields work in East Asia, the odds of success shouldn't be low. There just needs to be enough population using your language.

flohofwoe9mo ago

ksenzee9mo ago

simonaskOP9mo ago

nkrisc9mo ago

There was a time when most scientific literature was written in French. People learned French. Before that it was Latin. People learned Latin.

tehjoker9mo ago

wredcoll9mo ago

The napoleonic empire lasted about 15 years, so that's a bit of a stretch.

schrodinger9mo ago

American Imperialism has absolutely resulted in some horrible things, but I hardly think that ASCII is one of them.

Maybe "cultural dominance", but "imperialism at its worst" is a ridiculous take.

j / k navigate · click thread line to collapse