Notes from Learn Python The Hard Way 🐍
(continued after ex 1–19 & ex 20-22)

Now: exercise 23 about “Strings, Bytes, and Character Encodings”.


Detour to read Vaidehi Joshi’s post on Bits, Bytes, Building With Binary. I get how computers only know 1s and 0s, also the basics of what binary is. I remember math projects around age 11 where we did a lot of counting and converting. I’m not 100% sure if I’m able to convert from base 10 into binary efficiently if my life depended on it today, but I’m okay with that at the moment.

Just like a power of 10 in the decimal system — we’ve got a power of two in binary.

A single digit in binary is known as a binary digit.

8 bits (8 digits) is a byte

A byte is so common in the way that computers interpret binary that it is considered a unit of computer memory.

(This basecs series is amazing, looking foreward to go through the other topics she explains.)


Back to the python exercise:

Character Encoding

Unicode is not new for me, and I’ve dealt with encoding issues plenty over the years. (Hello Norwegian letters æ, ø, å!) But getting a better understanding of bits was cool.

  • first there was ASCII, but 256 characters is not enough past English
  • Unicode was created to solve the problem with 32 bits available

A 32-bit number means we can store 4,294,967,295 characters (2^32), which is enough space for every possible human language and probably a lot of alien ones too. Right now we use the extra space for important things like poop and smile emojis.

  • most common characters only need 8 bits
  • but we can then escape out into 16 or 32 as needed
  • Unicode Transformation Format 8 Bits 👉 utf-8
  • …is a convention for encoding Unicode characters into bytes

b'' is a byte string

Here with raw bytes on the left — and cooked on the right — this is 8, 16 and 32.
(I don’t understand now why 8 is longer than 16 for Simplified Chinese.)

b'Espa\xc3\xb1ol' <===> Español
b'Norsk bokm\xc3\xa5l' <===> Norsk bokmål
b'\xe4\xb8\xad\xe6\x96\x87' <===> 中文
b'\xff\xfeE\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00' <===> Español
b'\xff\xfeN\x00o\x00r\x00s\x00k\x00 \x00b\x00o\x00k\x00m\x00\xe5\x00l\x00' <===> Norsk bokmål
b'\xff\xfe-N\x87e' <===> 中文
b'\xff\xfe\x00\x00E\x00\x00\x00s\x00\x00\x00p\x00\x00\x00a\x00\x00\x00\xf1\x00\x00\x00o\x00\x00\x00l\x00\x00\x00' <===> Español
b'\xff\xfe\x00\x00N\x00\x00\x00o\x00\x00\x00r\x00\x00\x00s\x00\x00\x00k\x00\x00\x00 \x00\x00\x00b\x00\x00\x00o\x00\x00\x00k\x00\x00\x00m\x00\x00\x00\xe5\x00\x00\x00l\x00\x00\x00' <===> Norsk bokmål
b'\xff\xfe\x00\x00-N\x00\x00\x87e\x00\x00' <===> 中文

🤖 got bytes, need string? decode!
💁‍♀️ got string, need bytes? encode!

This makes sense, sort like encrypting can make a text less readable for me — but decrypting makes it something humans can understand.