Now: exercise 23 about “Strings, Bytes, and Character Encodings”.
Detour to read Vaidehi Joshi’s post on Bits, Bytes, Building With Binary. I get how computers only know 1s and 0s, also the basics of what binary is. I remember math projects around age 11 where we did a lot of counting and converting. I’m not 100% sure if I’m able to convert from base 10 into binary efficiently if my life depended on it today, but I’m okay with that at the moment.
A single digit in binary is known as a binary digit.
8 bits (8 digits) is a byte
A byte is so common in the way that computers interpret binary that it is considered a unit of computer memory.
(This basecs series is amazing, looking foreward to go through the other topics she explains.)
Back to the python exercise:
Unicode is not new for me, and I’ve dealt with encoding issues plenty over the years. (Hello Norwegian letters æ, ø, å!) But getting a better understanding of bits was cool.
- first there was ASCII, but 256 characters is not enough past English
- Unicode was created to solve the problem with 32 bits available
A 32-bit number means we can store 4,294,967,295 characters (2^32), which is enough space for every possible human language and probably a lot of alien ones too. Right now we use the extra space for important things like poop and smile emojis.
- most common characters only need 8 bits
- but we can then escape out into 16 or 32 as needed
- Unicode Transformation Format 8 Bits 👉
- …is a convention for encoding Unicode characters into bytes
b'' is a byte string
Here with raw bytes on the left — and cooked on the right — this is 8, 16 and 32.
(I don’t understand now why 8 is longer than 16 for Simplified Chinese.)
b'Espa\xc3\xb1ol' <===> Español b'Norsk bokm\xc3\xa5l' <===> Norsk bokmål b'\xe4\xb8\xad\xe6\x96\x87' <===> 中文
b'\xff\xfeE\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00' <===> Español b'\xff\xfeN\x00o\x00r\x00s\x00k\x00 \x00b\x00o\x00k\x00m\x00\xe5\x00l\x00' <===> Norsk bokmål b'\xff\xfe-N\x87e' <===> 中文
b'\xff\xfe\x00\x00E\x00\x00\x00s\x00\x00\x00p\x00\x00\x00a\x00\x00\x00\xf1\x00\x00\x00o\x00\x00\x00l\x00\x00\x00' <===> Español b'\xff\xfe\x00\x00N\x00\x00\x00o\x00\x00\x00r\x00\x00\x00s\x00\x00\x00k\x00\x00\x00 \x00\x00\x00b\x00\x00\x00o\x00\x00\x00k\x00\x00\x00m\x00\x00\x00\xe5\x00\x00\x00l\x00\x00\x00' <===> Norsk bokmål b'\xff\xfe\x00\x00-N\x00\x00\x87e\x00\x00' <===> 中文
🤖 got bytes, need string? decode!
💁♀️ got string, need bytes? encode!
This makes sense, sort like encrypting can make a text less readable for me — but decrypting makes it something humans can understand.