Those aren't "ASCII code pages". "Code pages" are a way to talk about character encodings, mostly these days used by Microsoft in its Windows operating systems, but historically because IBM's manuals would dedicate a whole page to each such encoding. They aren't "ASCII" code pages, although many of them reserve the first 128 codes for the same things ASCII put there.
"The upper 128 bits from the ASCII table" is presumably a mistake and means the upper 128 code values maybe?
It's called "us-ascii" because that's the name IANA assigned to the ASCII encoding. IANA keeps registries of a lot of stuff... here's the one with character sets in it: https://www.iana.org/assignments/character-sets/character-se...
Probably, the bullets at the start say "upper 128 positions".
Apologies for the minor nitpick: и is in Cyrillic, not Russian. Cyrillic is the script, Russian is the language. There are other languages that use Cyrillic besides Russian (and the script itself was developed around Greece/Bulgaria before Russia even existed).
For Arabic it's the language and the script so you're OK there!
Even with the caveat in parentheses, this is quite misleading. For example, the following line is some text, with no specified encoding:
> hello world
now, while its true this could be some exotic encoding, or maybe just random binary data, I wouldn't call it impossible to decipher. More accurate, would be "impossible to decipher with 100% certainty". Same issue exists with Protocol Buffers, or any format that is not self-describing. The data is not a black box, its just annoying to deal with.
Sorry, but I don't agree with this either. You can, as a human being (or smart enough AI), look at the result in both encodings, and make an educated guess as to which is correct. If they are wholly different as you say, then one should be gibberish, and one should map to some dictionary.
> ASCII code pages map the upper 128 positions (0x7F:0xFF) of the ASCII byte. Each page holds a different character set. This is one way internationalisation can be achieved.
This is at best a poor explanation, and at worst outright wrong. The actual key thing is charset--there's a wide variety of charsets. Because ASCII is an inherently 7-bit charset, a lot of charsets were created by setting the first 128 characters to be ASCII and mapping in different characters for these charsets. IBM (I believe) came up with the term 'code page' to refer to the different character sets they came up with.
> Unicode provides a unique code for every character, regardless of the language.
That's not really true. Unicode keeps track of "code points". Several code points may together make up what we think of a character--consider that something like à can consist of either a precomposed "à" code point or an "a" + "` diacritic" sequence. Thus there's an entire concern about Unicode normalization that a lot of people prefer to sweep under the rug.
> When creating a new file using touch, your computer will interpret that file as binary file.
Okay, what's happening here is you've got a command here, the file command, whose entire job is to look at a file and guess what the contents of that file is. For text files, part of that guessing process often involves guessing what the character encoding of the file is. That guessing is not always correct--there's the infamous "the printer can't print on Tuesdays bug" that was caused by the date string in the printer file, on Tuesdays, causing the file command to think it was an entirely different type of file [1]. There's another famous bug where starting a text file with a 4-letter word, two three-letter words, and another 4-letter word would cause Notepad to think the text file was in UTF-16 instead of ASCII [2].
With regards to guessing charsets, this is not always a particularly feasible process. Some charsets are more reliable to guess than others are. UTF-8, for example, tends to stick out--continuation bytes form a pattern that most charsets are unlikely to keep up with for long. Guessing ASCII for text that contains no 8-bit values set is pretty safe, since almost every charset is designed with ASCII-subset-safety in mind, and those that aren't (EBCDIC, UTF-7, UTF-16/UTF-32) are found in relatively constrained environments [3].
[1] https://beza1e1.tuxen.de/lore/print_on_tuesday.html
[2] https://en.wikipedia.org/wiki/Bush_hid_the_facts
[3] ISO-2022-* charsets are mode-switching, relying on the ESC character as part of the sequence to switch to different encodings. So you also have to consider the ESC character as a non-7-bit encoding for reliable ASCII detection.
Moreover, this is not strictly true even after the generous reinterpretation (assuming "unique-under-normalization", "code point sequence", "abstract character" and "script") because Unicode still doesn't encode some scripts [1].
*in my particular example, you can say unicode doesn't support Japanese, /or/ doesn't support Chinese. The answer depends on what font you're using. "Han Unification" affects more than just those two languages, but that's what I have experience with.
> consider that something like à can consist of either a precomposed "à" code point or an "a" + "` diacritic" sequence
If Unicode provides a precomposed combination doesn't it mean that in fact has a code point for every character? Regardless of offering diacritic combination codes?
Simple example is emojies where there isn't a precomposed codepoint for all combinations.