Hashing

Distribute Education like Computer Geek

Introduction

When we study algorithms or data structures, one question always comes –

How fast can we search?

Because in real life applications, searching happens again and again. Finding a student record, Checking whether a product ID exists or Verifying a username during login.

If searching is slow, the entire system becomes slow.

From Slow Searching to Fast Searching

Linear Search
In linear search, we check elements one by one. If there are 10,000 elements, in the worst case we may check all 10,000.
Time Complexity is O(n) so we have to search 10,000 times.

Binary Search
In binary search, we improve performance. But the condition is that the data must be sorted.
Time Complexity is O(log n)

So performance improved.
But sorting also takes extra effort nearly O(nlogn) time.

Now the goal becomes clear – Can we search in the constant time O(1)?

That is where hashing comes into the picture.

What is Hashing?

Hashing is a technique where we do not search for the element step by step. Instead, we calculate the position where the element should be stored.

So if we know the key, we directly go to that location.

This is why hashing can achieve O(1) time in ideal cases.

Example

Suppose we are storing student roll numbers. Assume roll numbers can go up to 10. Now imagine in our class we have only 6 students.

Their roll numbers are:

5, 1, 7, 10, 9, 2

If we want to find the key 10, we do not search one by one. We directly go to index 10 and check whether the key is present there or not. This direct access makes the time complexity O(1). It looks like a very successful and powerful method because searching becomes extremely fast.

But there is a small problem. There is a glitch.

Why Direct Indexing is not useful?

If we use direct indexing, and assume that roll numbers can go upto 5000, and imagine that we have a newly come student has a roll number 5000, what will we do?

Store 5 at index 5
Store 1 at index 1
Store 7 at index 7
Store 10 at index 10
Store 9 at index 9
Store 2 at index 2
Store 5000 at index 5000

Now think carefully. What size array do we need?
We need an array of size at least 5001.

But how many students do we actually have?
Only 6.

So what about the remaining positions?
All empty.

That means –

Thousands of spaces are unused
Memory is wasted
System becomes inefficient

Yes, searching will be very fast. But space usage becomes very poor. And in real systems, memory is important. That is why direct indexing is not a practical solution.

Instead, we use a hash function to reduce the index range. This is where hashing becomes powerful.

Basic Terminology

To understand hashing clearly, we should know three important terms.

Key – The value we want to store or search.

Hash Table – An array where data is stored.

Hash Function – A function that converts a key into an index of the array.

The hash function is the most important part.
A good hash function distributes data properly inside the table.

Instead of storing at the actual value index, we use a formula to limit the index range. A common hash function is –

H(x) = x mod m

Where:

x is the key
m is the size of the hash table.

Let us take a small example.

Assume the table size is 10.

If the key is 37, then 37 mod 10 = 7
So it will be stored at index 7.

If the key is 52, then 52 mod 10 = 2
So it will be stored at index 2.

In this way, no matter how large the key is,
the index will always be between 0 and 9.

This reduces memory usage and keeps searching fast.

Time Complexity of Hashing

1. In Ideal Condition

If the hash function distributes keys uniformly, then,

Search – O(1)
Insertion – O(1)
Deletion – O(1)

This constant time complexity is the main advantage of hashing but in the IDEAL CONDITION. If the keys are 52 and 32 both, then it will have a collision. Because 52 mod 10 = 2 and 32 mod 10 = 2. Who will enter and who will not. I will teach in the next part but there is one concept that m should be a prime number because if 52 mod 7 (a prime number) = 3 and the 32 mod 7 = 4. It will create less collision if you have m as a prime number.

Pigeonhole Principle in Hashing

In hashing, Keys are like pigeons and Hash table indices are like pigeonholes.

If the number of keys is greater than the number of slots in the hash table, then at least two keys must map to the same index. This situation is called collision. So collision is not a mistake. It is mathematically unavoidable.

Collision Handling Techniques

What is collision
Open addressing
Linear probing
Quadratic probing
Double hashing
Separate chaining
Clustering problem
Comparison of methods

We will study in the next part of algorithm.

Applications of Hashing in Real Life

Hashing is not just a theory topic. It is used in many real systems where we need fast searching and fast access.

1. Password Verification

When you create an account, your password is not stored directly. It is converted into a hash value and stored in the database.

2. Database Indexing

In large databases, searching a record line by line is very slow.
Hashing is used to create an index, so the system can directly jump to the required record.

3. Compiler Design – Symbol Table

In compilers, variables and function names are stored in a symbol table.
Hashing is used to quickly check whether a variable is already declared and what is its type. This makes compilation faster.

4. Caching Systems

Web browsers and servers use hashing in caching. When you visit a website, data is stored in cache. Hashing helps to quickly check whether data already exists in cache.

5. Searching in Data Structures

Languages like Python (dictionary), Java (HashMap) and C++ (unordered_map) are all use hashing internally to provide O(1) average time for search and insertion.

6. Duplicate Detection

Hashing is used to detect duplicate files and detect duplicate records and also remove repeated elements from a list because hashing makes comparison very fast.

Test Yourself

Q1- If direct indexing is used for storing 6 students with roll numbers 5, 18, 72, 450, 999, 5000, what is the minimum required array size?

6
1000
5001
999

Answer

Ans – (3)

Explanation – In direct indexing, the index is equal to the key. Since the largest roll number is 5000, we need an array that can store index 5000. Array indexing starts from 0, so size must be at least 5001. This shows why direct indexing wastes memory.

Q2- Which of the following is the main drawback of direct indexing?

Slow searching
Complex implementation
High memory usage
High time complexity

Answer

Ans – (3)

Explanation – Direct indexing gives O(1) search, so time is good. But if key range is large and actual data is small, memory is heavily wasted. That is the main limitation.

Q3- Why is m often chosen as a prime number in H(x) = x mod m?

To increase memory
To remove collision completely
To increase time complexity
To reduce clustering and collision

Answer

Ans – (4)

Explanation – Prime table size helps distribute keys more uniformly. It does not remove collision completely but reduces the probability.

Q4- If the number of keys is greater than table size, collision is:

Optional
Rare
Impossible
Guaranteed

Answer

Ans – (4)

Explanation – By Pigeonhole Principle, if keys > slots, at least two keys must share a slot. So collision is guaranteed.

Q5- In ideal hashing condition, time complexity of search is:

O(n)
O(log n)
O(1)
O(n log n)

Answer

Ans – (3)

Explanation – If hash function distributes uniformly and no heavy collision occurs, search, insert, delete all become O(1).

Q6- Which condition makes hashing degrade to O(n)?

Sorted data
All keys map to same index
Large table size
Prime table size

Answer

Ans – (2)

Explanation – If all keys collide at one index, then we must scan linearly inside that bucket. That becomes O(n).

Q7- Which statement is true?

Collision means hash function is wrong
Collision can be avoided completely
Collision is mathematically unavoidable
Collision occurs only when table is full

Answer

Ans – (3)

Explanation – By Pigeonhole Principle, collision is unavoidable. It is not a mistake, it is a mathematical property.

Q8- Explain how hashing reduces memory wastage compared to direct indexing.

Answer

Ans – Hashing uses a hash function to map large keys into a smaller index range. Instead of using actual key value as index, we compute H(x) = x mod m. This limits index between 0 and m−1. So even if key is very large, it is stored within fixed table size. This reduces memory wastage significantly.

Q9- Explain the importance of prime number table size.

Answer

Ans – When table size is prime, modulo operation distributes keys more uniformly. It reduces patterns and clustering. It does not remove collision completely but reduces probability of repeated collisions.

Q10- Compare linear search, binary search, and hashing.

Answer

Ans – Linear search has O(n) time. Binary search improves to O(log n) but requires sorted data. Hashing provides O(1) average time without needing sorted data. However, hashing may degrade to O(n) in worst case.

Q11- Explain why hashing is suitable for real-time systems.

Answer

Ans – Hashing provides O(1) average time for search and insertion. In real-time systems like caching, login verification, database lookup, fast response is required. Hashing ensures constant time access in most practical cases.

Hashing

Introduction

From Slow Searching to Fast Searching

What is Hashing?

Why Direct Indexing is not useful?

Basic Terminology

Time Complexity of Hashing

1. In Ideal Condition

Pigeonhole Principle in Hashing

Collision Handling Techniques

Applications of Hashing in Real Life

Test Yourself

Q1- If direct indexing is used for storing 6 students with roll numbers 5, 18, 72, 450, 999, 5000, what is the minimum required array size?

6

1000

5001

999

Q2- Which of the following is the main drawback of direct indexing?

Slow searching

Complex implementation

High memory usage

High time complexity

Q3- Why is m often chosen as a prime number in H(x) = x mod m?

To increase memory

To remove collision completely

To increase time complexity

To reduce clustering and collision

Q4- If the number of keys is greater than table size, collision is:

Optional

Rare

Impossible

Guaranteed

Q5- In ideal hashing condition, time complexity of search is:

O(n)

O(log n)

O(1)

O(n log n)

Q6- Which condition makes hashing degrade to O(n)?

Sorted data

All keys map to same index

Large table size

Prime table size

Q7- Which statement is true?

Collision means hash function is wrong

Collision can be avoided completely

Collision is mathematically unavoidable

Collision occurs only when table is full

Q8- Explain how hashing reduces memory wastage compared to direct indexing.

Q9- Explain the importance of prime number table size.

Q10- Compare linear search, binary search, and hashing.

Q11- Explain why hashing is suitable for real-time systems.

BOOKS