That's like "knowing" the truth. How?
I have received some very interesting files that made Python yack unicode errors, again and again. Why? Not only did I not "know" what encoding it was in -- the encodings changed at different points in the stream of bytes. I call this "slamming bytes together" because somewhere along the line, someone's program did exactly that.
Everything is simple -- until it isn't.
If you start guessing the encoding, at best it won't work in some cases, at worst you are introducing security vulnerabilities. You can try, but there is just no way to do it right.
http://michaelthelin.se/security/2014/06/08/web-security-cro...
There are lots of pleasing aspects to this choice. It's ASCII compatible of course, so anything that was actually ASCII is still ASCII, anything that was almost ASCII is just ASCII with U+FFFD where it deviated.
The replacement character resolutely isn't any of the specific things, nor any of the generic classes of thing you might be expected to treat differently for security reasons. It isn't a number, or a letter (of either "case"), it isn't white space, and it certainly isn't any of the separators, escapes or quote markers like ? or \ or + or . or _ or...
... yet it is still inside the BMP so it won't trigger weird (perhaps less well tested) behaviour for other planes.
It's self-synchronising. If something goes wrong somehow, in a few bytes if there is UTF-8 or an ASCII-compatible encoding the decoder will synchronise properly, you never end up "out of phase" as can happen for some encodings.
Most usefully, whatever you're now butted up against works with UTF-8 now. Maybe some day that'll get formally documented, maybe it won't. As the years drag on the chance of specifying _anything else_ shrink more, and the de facto popularity of UTF-8 means even if it's never formalised anywhere everybody will just assume UTF-8 anyway and you haven't to lift a finger.
What I am pointing out is that "know" is just doing a lot of magical work in that sentence.
Not simple to solve.
Like to win a race Usain Bolt, it's simple, run faster!
Fortunately, Python is well equipped for that. If you open a file with Python that you know it might contains mixed encoded text, you can use try/except to inform the user or open in binary mode, and just store the binary.
But my favorite way of doing it is:
open('file', error=strategy)
Strategy can be:- "ignore": undecodable text is skipped
- "replace": undecodable text is replace with "?"
- "surrogateescape" (you need to use utf8): undecodable text is decoded to a special representation which makes no human sense, but can en rencoded back to it's original value.
It's kinda ironic because people bashed Python for separating bytes/text, forcing them to deal with encoding correctly in Python 3. After all, this problem of "slamming bytes together" comes from languages that treat text as a bytes array, allowing this stupid mistake.
Unicode is great, as long as everyone upstream follows all of the rules and nothing goes wrong.
One of the reasons there is a lot of confusion about encodings vs Unicode is that Unicode was initially an encoding. It was thought that 65K characters was enough to represent all the characters in actual use across the languages and thus you just needed to change the from an 8 bit char to a 16 bit char and all would be well (apart from the issue of endianness). Thus Unicode initially specified what each symbol would look like encoded in 16bits. (see http://unicode.org/history/unicode88.pdf, particularly section 2). Windows NT, Java, ICU, all embraced this.
Then it turned out that you needed a lot more characters than 65K and instead of each character being 16 bits, you would need 32 bit characters (or else have weird 3 byte data types). Whereas people could justify going from 8 bits to 16 bits as a cost of not having to worry about charsets, most developers balked at 32 bits for every character. In addition, you now had a bunch of the early adopters (Java and Windows NT) that had already embraced 16 bit characters. So then encodings were hacked on such as UTF-16 (surrogate pairs of 16 bit characters for some unicode code points).
I think, if the problem had been understood better at the start that you have a lot more characters than will fit in 16 bits, then something UTF-8 would likely have been chosen as the canonical encoding and we could have avoided a lot of these issues. Alas, such is the benefit of 20/20 hindsight.
it's Fascinating to see how people can arrive at the answer No and conclude that the answer is Yes