A Live Developer Journal

Hash Tables Under The Cover

What is a hash table?

A hash table is a list of paired values (key and value). In the example below, each kitten (key) has a colour (value) associated with it.

1
2
3
4
5
6
7
kittens = {
  "Haribo" => "Champagne",
  "Berry" => "White",
  "Granger" => "White and fluffy",
  "Diamond" => "Tabby",
  "Mistique" => "Faded dark gray and ginger"
}

We can look up a kitten's colour using this syntax:

1
kittens["haribo"]

Hashing and Hash Functions

Hashing

Hashing is where you take a series of one or more characters and convert them into numbers.

1
2
3
4
5
chars = {
  "a" => 1,
  "b" => 2,
  "c" => 3
}

Hash function

A hash function is the process for converting each of the characters into numbers. Usually, a hash function returns the same length number no matter how many characters it started with.

1
2
3
4
5
function sum(char1, char2, char3) {
  return chars[char] + chars[char2] + chars[char3];
}

sum("a", "b", "c"); //6

How hash tables store their values using hashing

First, lets create a thesaurus and add some items to it.

1
thesaurus = {}

Adding first entry to the hash:

1
2
thesaurus["cab" = "taxi"]
thesaurus["ace" = "star"]

Thesaurus now looks like this:

1
2
3
4
thesaurus = {
  "cab" => "taxi",
  "ace" => "star"
}

Each time we added a new key value pair to our table, the computer applied a hash function to the key to convert it into a number. Then it stored the value in an index in the computers memory with that same number.

To access the value, the computer hashed the key again and looked at the resulting index to find the value that belongs to this key.

Hashmaps are super efficient in terms of Big O Notation, O(1). O(1) means that no matter how many key value pairs are in the hash table, it always takes the same number of steps to find the value it's looking for. Unlike an array where the number of steps it takes to find an item depends on how long the array is.

The hash function the computer uses is pretty complicated. To make the process easier to understand, our example below uses a much simpler hash function that we made up.

  1. The computer applies a hash function to the key: cab = 3 + 2 + 1 = 7.
  2. The computer stores the value "taxi" into cell 7.
  3. The computer applies a hash function to the key: star = 19 + 20 + 1 + 19 = 59.
  4. The computer stores the value "star" in cell 59.

We look up a value associated with the key of "cab":

1
thesaurus["cab"]
  1. The computer hashes the key we're looking up: cab = 3 + 2 + 1 = 7.
  2. The result is 7, so the computer looks inside the cell 7 and returns the value that is stored there: "taxi".

How to deal with collisions?

If the hash function produces the same hash value (cell number) as another key that has been hashed, it'll store the value in a sub-array inside of that cell. It'll look at all the 0 indexs of the sub-arrays until it finds the key it's looking for, then will return the value stored at index 1 of the correct sub-array.

What is the great balancing act?

A good hash table strikes a balance between avoiding collisions while not consuming memory. E.g. Storing 5 bits of data across 100 cells wastes memory, whilst storing 100 bits of data in 5 cells causes too much collision.

A good rule of thumb: For every 7 data elements stored in a hash table, it should have 10 cells, which is known as the ideal load factor (7 elements / 10 cells).

If you initially stored 7 pieces of data in a hash, the computer might allocate a hash with 10 cells. When you begin to add more data, the computer will expand the hash by adding more cells and changing the hash function so that the new data will be distributed evenly across the new cells.

Most languages handle this for you.