Decode Bytes and Encode Strings
Notes from Learn Python The Hard Way š
(continued after ex 1ā19 &
ex 20-22)
Now: exercise 23 about āStrings, Bytes, and Character Encodingsā.
Detour to read Vaidehi Joshiās post on Bits, Bytes, Building With Binary. I get how computers only know 1s and 0s, also the basics of what binary is. IĀ remember math projects around age 11 where we did a lot of counting and converting. Iām not 100% sure if Iām able to convert from base 10 into binary efficiently if my life depended on it today, but Iām okay with that at the moment.
Just like a power of 10 in the decimal system ā weāve got a power of two in binary.
A single digit in binary is known as a binary digit.
8 bits (8 digits) is a byte
A byte is so common in the way that computers interpret binary that it is considered a unit of computer memory.
(This basecs series is amazing, looking foreward to go through the other topics she explains.)
Back to the python exercise:
Character Encoding
Unicode is not new for me, and Iāve dealt with encoding issues plenty over the years. (Hello Norwegian letters Ʀ, Ćø, Ć„!) But getting a better understanding of bits was cool.
- first there was ASCII, but 256 characters is not enough past English
- Unicode was created to solve the problem with 32 bits available
A 32-bit number means we can store 4,294,967,295 characters (2^32), which is enough space for every possible human language and probably a lot of alien ones too. Right now we use the extra space for important things like poop and smile emojis.
- most common characters only need 8 bits
- but we can then escape out into 16 or 32 as needed
- Unicode Transformation Format 8 Bits š
utf-8
- ā¦is a convention for encoding Unicode characters into bytes
b''
is a byte string
Here with raw bytes on the left ā and cooked on the right ā this is 8, 16 and 32.
(I don't understand now why 8 is longer than 16 for Simplified Chinese.)
b'Espa\xc3\xb1ol' <===> EspaƱol
b'Norsk bokm\xc3\xa5l' <===> Norsk bokmƄl
b'\xe4\xb8\xad\xe6\x96\x87' <===> äøę
b'\xff\xfeE\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00' <===> EspaƱol
b'\xff\xfeN\x00o\x00r\x00s\x00k\x00 \x00b\x00o\x00k\x00m\x00\xe5\x00l\x00' <===> Norsk bokmƄl
b'\xff\xfe-N\x87e' <===> äøę
b'\xff\xfe\x00\x00E\x00\x00\x00s\x00\x00\x00p\x00\x00\x00a\x00\x00\x00\xf1\x00\x00\x00o\x00\x00\x00l\x00\x00\x00' <===> EspaƱol
b'\xff\xfe\x00\x00N\x00\x00\x00o\x00\x00\x00r\x00\x00\x00s\x00\x00\x00k\x00\x00\x00 \x00\x00\x00b\x00\x00\x00o\x00\x00\x00k\x00\x00\x00m\x00\x00\x00\xe5\x00\x00\x00l\x00\x00\x00' <===> Norsk bokmƄl
b'\xff\xfe\x00\x00-N\x00\x00\x87e\x00\x00' <===> äøę
š¤āgot bytes, need string? decode!
šāāļøāgot string, need bytes? encode!
This makes sense, sort like encrypting can make a text less readable for me ā but decrypting makes it something humans can understand.