July 10, 2015

Introduction to Hashing

technology

Hello everyone! 🙂 I hope you are doing well. My collage holidays are still going on, and I have way more than free time that I should have in my hands. Anyway, day before yesterday I took part in hashing monk competition in hackerearth. I learnt about hashing just before that. I could answer 1 question perfectly well, one partially and other two I could only think of ways to do it, but extremely time costly. I have still ways to go. Fortunately this is a weekly contest, so I can take part again! I have written some lines of python code which should be self explanatory even if you don’t know python. If you have any doubts in that, you can ask me in comments.

Hashing is a very smart way to store any object in an array. Hashing in simplest terms would mean indexing. What would be the advantage of hashing any object? Lets say you are given a sorted array and you have to find an element. The best way would be a binary search with O(logn), which is nice, seeing we can search from a million array list in just 6 passes, but only IF it is sorted. We can do better by hashing the values and searching in just O(1) time, unsorted. Sounds nice?

Sounds really cool! But what exactly is hashing and how do we do it? Lets say you are given an array or list of numbers [3, 5, 1, 4, 6, 13, 7] which is of size 7. Initialize a list with None of say size 10. You can do this by

table = [None for x in xrange(7)]

We have to map each entry of the given array into our new formed table, using some hash function. Any value that would be returned by the hash function would be the index for our new table. Therefore a hash function is a function that maps elements of the array to our new table. For the above array, consider a hash function say

def hashh(number, size_of_table):

     #'hash' is a keyword in python

     return number % size_of_table

It simply returns the mod of a number to the size_of_table(In this case 10). This value would be our index for the new table. So for value 3, our hash function would return 3, for 5 it would 5 and so on. If we are searching for some value, say x, we can pass it to the hash function and see if the given value is present in our table. Its running time will always be O(1).

Sounds so good to be true? Yes indeed. There is always something that obstructs from getting the perfect method, and if you are paying attention closely you would notice this hash function that we wrote, is very naive. What about the number 13? It will return 3, and in our table index 3 is already occupied. This is called collision. No matter how good hash function you write, there is bound to be some collision. If we are already given a list to store, it is possible to create a perfect hash function, but in real world values are constantly being added.

There are several ways to tackle this problem and still keep the running time O(1). This is called collision resolution. One of them is linear probing. If we find an index and it is already occupied in the table, we go through the next indexes, till we find an empty one and store there. Note that for this we will have to consider the table as a circular array. One more way is quadratic probing. Suppose a hash function gives an index h and it is filled, so we will hash it again which will give h+1, h+4, h+9 and so on. This increases the hash vale by perfect square.

This was an introduction to hashing and we covered few basic concepts of hashing. If you want to know more there are tons of resources you can refer to. If you are a python enthusiast I suggest you go through this. It is also necessary to learn more formal terms which I have skipped here. As always if you have any doubt or suggestion, please comment it below! Thank you! 😀