Notes from Learn Python The Hard Way šŸ
(continued after ex 1ā€“19 & ex 20-22)

Now: exercise 23 about ā€œStrings, Bytes, and Character Encodingsā€.


Detour to read Vaidehi Joshiā€™s post on Bits, Bytes, Building With Binary. I get how computers only know 1s and 0s, also the basics of what binary is. IĀ remember math projects around age 11 where we did a lot of counting and converting. Iā€™m not 100% sure if Iā€™m able to convert from base 10 into binary efficiently if my life depended on it today, but Iā€™m okay with that at the moment.

Just like a power of 10 in the decimal system ā€” weā€™ve got a power of two in binary.

A single digit in binary is known as a binary digit.

8 bits (8 digits) is a byte

A byte is so common in the way that computers interpret binary that it is considered a unit of computer memory.

(This basecs series is amazing, looking foreward to go through the other topics she explains.)


Back to the python exercise:

Character Encoding

Unicode is not new for me, and Iā€™ve dealt with encoding issues plenty over the years. (Hello Norwegian letters Ʀ, Ćø, Ć„!) But getting a better understanding of bits was cool.

  • first there was ASCII, but 256 characters is not enough past English
  • Unicode was created to solve the problem with 32 bits available

A 32-bit number means we can store 4,294,967,295 characters (2^32), which is enough space for every possible human language and probably a lot of alien ones too. Right now we use the extra space for important things like poop and smile emojis.

  • most common characters only need 8 bits
  • but we can then escape out into 16 or 32 as needed
  • Unicode Transformation Format 8 Bits šŸ‘‰ utf-8
  • ā€¦is a convention for encoding Unicode characters into bytes

b'' is a byte string

Here with raw bytes on the left ā€” and cooked on the right ā€” this is 8, 16 and 32.
(I don't understand now why 8 is longer than 16 for Simplified Chinese.)

b'Espa\xc3\xb1ol' <===> EspaƱol
b'Norsk bokm\xc3\xa5l' <===> Norsk bokmƄl
b'\xe4\xb8\xad\xe6\x96\x87' <===> äø­ę–‡
b'\xff\xfeE\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00' <===> EspaƱol
b'\xff\xfeN\x00o\x00r\x00s\x00k\x00 \x00b\x00o\x00k\x00m\x00\xe5\x00l\x00' <===> Norsk bokmƄl
b'\xff\xfe-N\x87e' <===> äø­ę–‡
b'\xff\xfe\x00\x00E\x00\x00\x00s\x00\x00\x00p\x00\x00\x00a\x00\x00\x00\xf1\x00\x00\x00o\x00\x00\x00l\x00\x00\x00' <===> EspaƱol
b'\xff\xfe\x00\x00N\x00\x00\x00o\x00\x00\x00r\x00\x00\x00s\x00\x00\x00k\x00\x00\x00 \x00\x00\x00b\x00\x00\x00o\x00\x00\x00k\x00\x00\x00m\x00\x00\x00\xe5\x00\x00\x00l\x00\x00\x00' <===> Norsk bokmƄl
b'\xff\xfe\x00\x00-N\x00\x00\x87e\x00\x00' <===> äø­ę–‡

šŸ¤–ā€‚got bytes, need string? decode!
šŸ’ā€ā™€ļøā€‚got string, need bytes? encode!

This makes sense, sort like encrypting can make a text less readable for me ā€” but decrypting makes it something humans can understand.