My spell-checker for CS50 needs a hash function. The problem set specifications describe that students can “search for (good) hash functions online” as long as I cite the origin of what I integrate in my code. So… what exactly makes a hash function any good?!
a hash table should have an easy-to-compute function because anything that’s too hard to compute will take up too much time an space! An expensive function defeats the purpose of finding an efficient data structure
if the hash function didn’t return the same key every time? Well, that would be very bad, because we’d never be able to retrieve the data after we stored it, because we could never be sure where things are
we have to understand how to handle collisions, because they are almost certainly going to happen. This is probably the most important thing to understand about hash functions, because they each need to account for collisions.
Vaidehi’s article helped me understand the difference between these two:
- linear probing (look for the next empty bucket)
- chaining (using linked lists in each bucket).
Searching further, I found another text I really enjoyed: the part on hash tables in Robert Nystrom’s Crafting Interpreters. 🥰 TIL about the 'birthday problem' and 'pigeonhole principle' and there is a fantastic illustration of the two combined. But also this description:
a simple, well-worn hash function called FNV-1a that’s served me fine over the years.
Sounds great! But gotta say, the FNV-1a page is kinda overwhelming. Wikipedia’s Fowler–Noll–Vo is a nicer read. All right, so should/could I use this one?! 🤔 I didn’t find examples of implementations in C that I easily understood. But when I came across this example of djb2, it was more straight forward for me to see how to use this hash function. So I did.
wait, what did I need the hash function to do?
Backtrack! I started digging further into hash functions today because: I initially (weeks ago) thought my program worked, but then I realised it didn't and suspected that the hash function was buggy. Why? Because as long as I had only a single hash bucket, the program found the correct number of misspelled words. And I kinda thought… oh well maybe I didn't find a good hash function? I was impatient and wanted to move on to the next part of the problem. What I now understand — is that the hash function I found is probably fine enough (no version control in the CS50 IDE, so not entirely sure which one it was) — but that my implementation of it in no way set out to handle capitalized words.
strcasecmp() ignores case in my
check function, but my yolo hash function copy-pasta puts
Caterpillar into different buckets. My program worked with 1 hash bucket, because then there was just a single linked list for the
check function to traverse. But when I increase the number of buckets, the
check function will have no way of looking in the correct linked list with capitalized words. 🤯😅😁😎🥳
Doh. The hash function isn’t only used when I load words from the dictionary into the hash table — I also use it to check words from the text. And those can be capitalized. Haha. So I need to make sure the input to the hash function can be alphabetical characters and apostrophes — and that values
cat receive the same key, an identical numerical index for the hash table.
If I had read the problem set description properly, I would have seen this explicitly hinted. But I guess tracking this down as a bug was a better learning experience.
Other stuff I learnt about when CS50 sent me searching “for (good) hash functions online”.
Cryptograpic or nah?
Hash functions are used to solve a whole bunch of different problems, and it was helpful to understand that I did not need one that is a cyclic redundancy check, a checksum function or cryptographic. Wikipedia has a list with the categories.
Checksum functions are related to hash functions, fingerprints, randomization functions, and cryptographic hash functions. However, each of those concepts has different applications and therefore different design goals. For instance, a function returning the start of a string can provide a hash appropriate for some applications but will never be a suitable checksum. — wikipedia.org/wiki/Checksum
More on collisions
Apparently not just a theoretical construct. This stackexchange answer writes that the djb2 algorithm I am using can have 7 collisions in the test they did, for example:
My spell-checker can live with that.
Now I just need to fix that memory leak. 🚨
But everything else works and this time I am certain!