Designing Good Hash Functions

Distribute Education like Computer Geek

Properties of a Good Hash Function

A hash function is not just any formula. It must satisfy certain properties.

1 Uniform Distribution

The hash function should distribute keys evenly across the table. If most keys go to only 2 or 3 indices, then performance becomes poor.

Non Uniformed Hashing --- Designing Good Hash Functions

Our goal is to spread keys like salt on food, not like a pile of sugar in one corner. Uniform distribution reduces collisions and improves performance.

.

2 Fast Computation

Hashing is used to make search fast. If the hash function itself takes too much time, then what is the point?

A good hash function must be

  • Simple
  • Quick to compute
  • Efficient in real systems

Operations like modulo, multiplication is commonly used because they are fast.

.

3 Deterministic Nature

A hash function must always give the same output for the same key.

If today h(52) = 2 and tomorrow h(52) = 7, then the system will break. Because when we insert the key 52, we store it at index 2. Later, when we try to search for 52, we again apply the hash function. If it now gives 7, the system will go to index 7 and will not find the key. It will wrongly conclude that the key does not exist. So, deterministic behaviour is mandatory.

.

Types of Hash Functions

Let us now study common hash function techniques used.

1. Division Method

Formula is h(k) = k mod m

This is the most common and simple method.

If m = 10 and k = 52

52 mod 10 = 2, this is stored at index 2.

If m is badly chosen (like power of 2 in some patterns), collisions increase. Modulo operation mainly depends on the lower bits of the number in binary. If the keys have similar lower bits, they will collide again and again.
2 mod 2 = 0
4 mod 2 = 0
6 mod 2 = 0
8 mod 2 = 0

3 mod 2 = 1
5 mod 2 = 1
7 mod 2 = 1
9 mod 2 = 1

Even numbers and odd numbers pile up and creates collision every time. This is pattern based collision. That is why prime numbers are preferred. Prime numbers break these patterns.

.

Choosing Table Size

This is where many students make mistakes.

Why Prime Numbers?

If we use h(k) = k mod m and m is a prime number, distribution improves.

If m = 10 and 18 mod 10 = 8. Also, 58 mod 10 = 8 then it will make a Collision.

If m = 7 and 52 mod 7 = 3. Also, 32 mod 7 = 4 then it will be no collision. Prime numbers reduce clustering patterns.

Prime Number Hash table - Designing Hash table functions

.

2. Multiplication Method

Formula is h(k) = floor(m × (kA mod 1)) where A is a constant between 0 and 1.

This method reduces dependency on table size and sometimes gives better distribution. It is slightly more complex but useful in practice.

.

3. Mid-square Method

Steps –

  1. Square the key
  2. Extract middle digits
  3. Use them as index

Example

k = 44
44² = 1936
Middle digits → 93

So, index becomes 93 (depending on table size). This method works well when keys have similar patterns.

.

4. Folding Method

Divide the key into parts and add them.

k = 123456
Split – 123 and 456
Add = 123 + 456 = 579

Then take mod with table size. It is useful for large numeric keys like phone numbers.

.

5. Digit Extraction Method

Select specific digits from the key.

k = 987654
Take 2nd and 4th digits → 8 and 6 so the Index = 86

This method works only if selected digits vary uniformly.

.

Load Factor

Load Factor = n / m                                                                     

Where n = number of elements and m = table size

Load Factor of Hashing

Load factor tells how full the table is. If load factor increases

  • Collision increases
  • Performance degrades

In ideal hashing, Search, Insert, Delete → O(1)

But if load factor becomes too high, operations may degrade to O(n). That is why resizing and rehashing are important in real systems.

.

Effect of Load Factor on Performance

Low load factor → Less collision → Faster performance

High load factor → More collision → Slower performance

So, balance is necessary. Not too empty. Not too full. Just like a classroom. If 5 students sit in 50 seats, space is wasted.
If 100 students sit in 50 seats, collision happens. Hash table is also like that.

Test Yourself

Q1- If a hash function distributes 90% of keys into only 2 indices and the remaining 10% across other indices, which property is violated?
  1. Deterministic nature
  2. Fast computation
  3. Uniform distribution
  4. Load factor

Ans – (3)

Explanation – Uniform distribution means keys must spread evenly across the table. If most keys cluster in 2 indices, performance degrades.

Q2- Suppose m = 16 and keys are multiples of 4. Which issue is most likely?
  1. Deterministic failure
  2. Reduced load factor
  3. Pattern-based collision
  4. Hash function becomes random

Ans – (3)

Explanation – 16 is power of 2. Modulo operation depends on lower bits. Multiples of 4 share similar lower bits, causing repeated collisions.

Q3- Which statement is true?
  1. Prime table size eliminates collision
  2. Prime table size reduces clustering
  3. Prime numbers increase load factor
  4. Prime numbers make hashing O(1) always

Ans – (2)

Explanation – Prime size does not remove collision, but reduces pattern-based clustering.

Q4- If load factor becomes 3, what does it imply?
  1. Table is half full
  2. Table is exactly full
  3. No collision possible
  4. More elements than slots

Ans – (4)

Explanation – Load factor = n / m
If n = 150 and m = 50, load factor = 3. That means table is overloaded.

Q5- If a hash function gives different outputs for the same key at different times, what fails?
  1. Uniform distribution
  2. Deterministic property
  3. Load factor
  4. Prime property

Ans – (2)

Explanation – Same input must always produce same output. Otherwise search and deletion will fail.

Q6- In division method, if m is 10 and keys end with same digit, what happens?
  1. Uniform distribution
  2. Increased collision
  3. Reduced load factor
  4. Random mapping

Ans – (2)

Explanation – Keys – 10, 40, 20, 30 and m = 10.

Modulo 10 depends on last digit. Keys with same last digit map to same index.

Q7- Which scenario guarantees collision?
  1. Load factor < 1
  2. m is prime
  3. n > m
  4. Hash function is deterministic

Ans – (3)

Explanation – By Pigeonhole Principle, if number of keys exceeds number of slots, collision must occur.

Q8- If all keys are even and m = 2, what happens?
  1. Perfect distribution
  2. All keys map to index 0
  3. All keys map to index 1
  4. No collision

Ans – (2)

Explanation – Even mod 2 = 0. So complete clustering happens.

Q9- Why does high load factor degrade performance even if hash function is good?

Ans – Even a good hash function cannot prevent collision if table becomes overcrowded. When n increases while m remains fixed, average bucket size increases, leading to longer search time.

Q10- Explain mathematically why collision is unavoidable.

Ans – If number of keys exceeds number of slots, by Pigeonhole Principle at least two keys must share one slot. Therefore collision is mathematically guaranteed.

Q11- How does uniform distribution improve average time complexity?

Ans – If keys spread evenly, expected number of elements per bucket remains small. Hence search, insert, delete stay close to O(1).

BOOKS

Algorithm T H CoremanAlgorithm by Udit AgarwalAlgorithm Ellis HorowitzData Structure & Algorithm Made Easy Narasimha