This commit is contained in:
krahets
2024-04-06 03:02:20 +08:00
parent 0a9daa8b9f
commit 8d37c215c8
148 changed files with 70398 additions and 408 deletions

View File

@ -2,15 +2,15 @@
comments: true
---
# 6.3   Hash Algorithms
# 6.3   Hash algorithms
The previous two sections introduced the working principle of hash tables and the methods to handle hash collisions. However, both open addressing and chaining can **only ensure that the hash table functions normally when collisions occur, but cannot reduce the frequency of hash collisions**.
If hash collisions occur too frequently, the performance of the hash table will deteriorate drastically. As shown in the Figure 6-8 , for a chaining hash table, in the ideal case, the key-value pairs are evenly distributed across the buckets, achieving optimal query efficiency; in the worst case, all key-value pairs are stored in the same bucket, degrading the time complexity to $O(n)$.
![Ideal and Worst Cases of Hash Collisions](hash_algorithm.assets/hash_collision_best_worst_condition.png){ class="animation-figure" }
![Ideal and worst cases of hash collisions](hash_algorithm.assets/hash_collision_best_worst_condition.png){ class="animation-figure" }
<p align="center"> Figure 6-8 &nbsp; Ideal and Worst Cases of Hash Collisions </p>
<p align="center"> Figure 6-8 &nbsp; Ideal and worst cases of hash collisions </p>
**The distribution of key-value pairs is determined by the hash function**. Recalling the steps of calculating a hash function, first compute the hash value, then modulo it by the array length:
@ -22,35 +22,35 @@ Observing the above formula, when the hash table capacity `capacity` is fixed, *
This means that, to reduce the probability of hash collisions, we should focus on the design of the hash algorithm `hash()`.
## 6.3.1 &nbsp; Goals of Hash Algorithms
## 6.3.1 &nbsp; Goals of hash algorithms
To achieve a "fast and stable" hash table data structure, hash algorithms should have the following characteristics:
- **Determinism**: For the same input, the hash algorithm should always produce the same output. Only then can the hash table be reliable.
- **High Efficiency**: The process of computing the hash value should be fast enough. The smaller the computational overhead, the more practical the hash table.
- **Uniform Distribution**: The hash algorithm should ensure that key-value pairs are evenly distributed in the hash table. The more uniform the distribution, the lower the probability of hash collisions.
- **High efficiency**: The process of computing the hash value should be fast enough. The smaller the computational overhead, the more practical the hash table.
- **Uniform distribution**: The hash algorithm should ensure that key-value pairs are evenly distributed in the hash table. The more uniform the distribution, the lower the probability of hash collisions.
In fact, hash algorithms are not only used to implement hash tables but are also widely applied in other fields.
- **Password Storage**: To protect the security of user passwords, systems usually do not store the plaintext passwords but rather the hash values of the passwords. When a user enters a password, the system calculates the hash value of the input and compares it with the stored hash value. If they match, the password is considered correct.
- **Data Integrity Check**: The data sender can calculate the hash value of the data and send it along; the receiver can recalculate the hash value of the received data and compare it with the received hash value. If they match, the data is considered intact.
- **Password storage**: To protect the security of user passwords, systems usually do not store the plaintext passwords but rather the hash values of the passwords. When a user enters a password, the system calculates the hash value of the input and compares it with the stored hash value. If they match, the password is considered correct.
- **Data integrity check**: The data sender can calculate the hash value of the data and send it along; the receiver can recalculate the hash value of the received data and compare it with the received hash value. If they match, the data is considered intact.
For cryptographic applications, to prevent reverse engineering such as deducing the original password from the hash value, hash algorithms need higher-level security features.
- **Unidirectionality**: It should be impossible to deduce any information about the input data from the hash value.
- **Collision Resistance**: It should be extremely difficult to find two different inputs that produce the same hash value.
- **Avalanche Effect**: Minor changes in the input should lead to significant and unpredictable changes in the output.
- **Collision resistance**: It should be extremely difficult to find two different inputs that produce the same hash value.
- **Avalanche effect**: Minor changes in the input should lead to significant and unpredictable changes in the output.
Note that **"Uniform Distribution" and "Collision Resistance" are two separate concepts**. Satisfying uniform distribution does not necessarily mean collision resistance. For example, under random input `key`, the hash function `key % 100` can produce a uniformly distributed output. However, this hash algorithm is too simple, and all `key` with the same last two digits will have the same output, making it easy to deduce a usable `key` from the hash value, thereby cracking the password.
## 6.3.2 &nbsp; Design of Hash Algorithms
## 6.3.2 &nbsp; Design of hash algorithms
The design of hash algorithms is a complex issue that requires consideration of many factors. However, for some less demanding scenarios, we can also design some simple hash algorithms.
- **Additive Hash**: Add up the ASCII codes of each character in the input and use the total sum as the hash value.
- **Multiplicative Hash**: Utilize the non-correlation of multiplication, multiplying each round by a constant, accumulating the ASCII codes of each character into the hash value.
- **XOR Hash**: Accumulate the hash value by XORing each element of the input data.
- **Rotating Hash**: Accumulate the ASCII code of each character into a hash value, performing a rotation operation on the hash value before each accumulation.
- **Additive hash**: Add up the ASCII codes of each character in the input and use the total sum as the hash value.
- **Multiplicative hash**: Utilize the non-correlation of multiplication, multiplying each round by a constant, accumulating the ASCII codes of each character into the hash value.
- **XOR hash**: Accumulate the hash value by XORing each element of the input data.
- **Rotating hash**: Accumulate the ASCII code of each character into a hash value, performing a rotation operation on the hash value before each accumulation.
=== "Python"
@ -651,7 +651,7 @@ It is worth noting that if the `key` is guaranteed to be randomly and uniformly
In summary, we usually choose a prime number as the modulus, and this prime number should be large enough to eliminate periodic patterns as much as possible, enhancing the robustness of the hash algorithm.
## 6.3.3 &nbsp; Common Hash Algorithms
## 6.3.3 &nbsp; Common hash algorithms
It is not hard to see that the simple hash algorithms mentioned above are quite "fragile" and far from reaching the design goals of hash algorithms. For example, since addition and XOR obey the commutative law, additive hash and XOR hash cannot distinguish strings with the same content but in different order, which may exacerbate hash collisions and cause security issues.
@ -663,7 +663,7 @@ Over the past century, hash algorithms have been in a continuous process of upgr
- SHA-2 series, especially SHA-256, is one of the most secure hash algorithms to date, with no successful attacks reported, hence commonly used in various security applications and protocols.
- SHA-3 has lower implementation costs and higher computational efficiency compared to SHA-2, but its current usage coverage is not as extensive as the SHA-2 series.
<p align="center"> Table 6-2 &nbsp; Common Hash Algorithms </p>
<p align="center"> Table 6-2 &nbsp; Common hash algorithms </p>
<div class="center-table" markdown>
@ -677,7 +677,7 @@ Over the past century, hash algorithms have been in a continuous process of upgr
</div>
# Hash Values in Data Structures
# Hash values in data structures
We know that the keys in a hash table can be of various data types such as integers, decimals, or strings. Programming languages usually provide built-in hash algorithms for these data types to calculate the bucket indices in the hash table. Taking Python as an example, we can use the `hash()` function to compute the hash values for various data types.

View File

@ -2,7 +2,7 @@
comments: true
---
# 6.2 &nbsp; Hash Collision
# 6.2 &nbsp; Hash collision
As mentioned in the previous section, **usually the input space of a hash function is much larger than its output space**, making hash collisions theoretically inevitable. For example, if the input space consists of all integers and the output space is the size of the array capacity, multiple integers will inevitably map to the same bucket index.
@ -13,24 +13,24 @@ Hash collisions can lead to incorrect query results, severely affecting the usab
There are mainly two methods for improving the structure of hash tables: "Separate Chaining" and "Open Addressing".
## 6.2.1 &nbsp; Separate Chaining
## 6.2.1 &nbsp; Separate chaining
In the original hash table, each bucket can store only one key-value pair. "Separate chaining" transforms individual elements into a linked list, with key-value pairs as list nodes, storing all colliding key-value pairs in the same list. The Figure 6-5 shows an example of a hash table with separate chaining.
![Separate Chaining Hash Table](hash_collision.assets/hash_table_chaining.png){ class="animation-figure" }
![Separate chaining hash table](hash_collision.assets/hash_table_chaining.png){ class="animation-figure" }
<p align="center"> Figure 6-5 &nbsp; Separate Chaining Hash Table </p>
<p align="center"> Figure 6-5 &nbsp; Separate chaining hash table </p>
The operations of a hash table implemented with separate chaining have changed as follows:
- **Querying Elements**: Input `key`, pass through the hash function to obtain the bucket index, access the head node of the list, then traverse the list and compare `key` to find the target key-value pair.
- **Adding Elements**: First access the list head node via the hash function, then add the node (key-value pair) to the list.
- **Deleting Elements**: Access the list head based on the hash function's result, then traverse the list to find and remove the target node.
- **Querying elements**: Input `key`, pass through the hash function to obtain the bucket index, access the head node of the list, then traverse the list and compare `key` to find the target key-value pair.
- **Adding elements**: First access the list head node via the hash function, then add the node (key-value pair) to the list.
- **Deleting elements**: Access the list head based on the hash function's result, then traverse the list to find and remove the target node.
Separate chaining has the following limitations:
- **Increased Space Usage**: The linked list contains node pointers, which consume more memory space than arrays.
- **Reduced Query Efficiency**: Due to the need for linear traversal of the list to find the corresponding element.
- **Increased space usage**: The linked list contains node pointers, which consume more memory space than arrays.
- **Reduced query efficiency**: Due to the need for linear traversal of the list to find the corresponding element.
The code below provides a simple implementation of a separate chaining hash table, with two things to note:
@ -1445,32 +1445,32 @@ The code below provides a simple implementation of a separate chaining hash tabl
It's worth noting that when the list is very long, the query efficiency $O(n)$ is poor. **At this point, the list can be converted to an "AVL tree" or "Red-Black tree"** to optimize the time complexity of the query operation to $O(\log n)$.
## 6.2.2 &nbsp; Open Addressing
## 6.2.2 &nbsp; Open addressing
"Open addressing" does not introduce additional data structures but uses "multiple probes" to handle hash collisions. The probing methods mainly include linear probing, quadratic probing, and double hashing.
Let's use linear probing as an example to introduce the mechanism of open addressing hash tables.
### 1. &nbsp; Linear Probing
### 1. &nbsp; Linear probing
Linear probing uses a fixed-step linear search for probing, differing from ordinary hash tables.
- **Inserting Elements**: Calculate the bucket index using the hash function. If the bucket already contains an element, linearly traverse forward from the conflict position (usually with a step size of $1$) until an empty bucket is found, then insert the element.
- **Searching for Elements**: If a hash collision is found, use the same step size to linearly traverse forward until the corresponding element is found and return `value`; if an empty bucket is encountered, it means the target element is not in the hash table, so return `None`.
- **Inserting elements**: Calculate the bucket index using the hash function. If the bucket already contains an element, linearly traverse forward from the conflict position (usually with a step size of $1$) until an empty bucket is found, then insert the element.
- **Searching for elements**: If a hash collision is found, use the same step size to linearly traverse forward until the corresponding element is found and return `value`; if an empty bucket is encountered, it means the target element is not in the hash table, so return `None`.
The Figure 6-6 shows the distribution of key-value pairs in an open addressing (linear probing) hash table. According to this hash function, keys with the same last two digits will be mapped to the same bucket. Through linear probing, they are stored consecutively in that bucket and the buckets below it.
![Distribution of Key-Value Pairs in Open Addressing (Linear Probing) Hash Table](hash_collision.assets/hash_table_linear_probing.png){ class="animation-figure" }
![Distribution of key-value pairs in open addressing (linear probing) hash table](hash_collision.assets/hash_table_linear_probing.png){ class="animation-figure" }
<p align="center"> Figure 6-6 &nbsp; Distribution of Key-Value Pairs in Open Addressing (Linear Probing) Hash Table </p>
<p align="center"> Figure 6-6 &nbsp; Distribution of key-value pairs in open addressing (linear probing) hash table </p>
However, **linear probing tends to create "clustering"**. Specifically, the longer a continuous position in the array is occupied, the more likely these positions are to encounter hash collisions, further promoting the growth of these clusters and eventually leading to deterioration in the efficiency of operations.
It's important to note that **we cannot directly delete elements in an open addressing hash table**. Deleting an element creates an empty bucket `None` in the array. When searching for elements, if linear probing encounters this empty bucket, it will return, making the elements below this bucket inaccessible. The program may incorrectly assume these elements do not exist, as shown in the Figure 6-7 .
![Query Issues Caused by Deletion in Open Addressing](hash_collision.assets/hash_table_open_addressing_deletion.png){ class="animation-figure" }
![Query issues caused by deletion in open addressing](hash_collision.assets/hash_table_open_addressing_deletion.png){ class="animation-figure" }
<p align="center"> Figure 6-7 &nbsp; Query Issues Caused by Deletion in Open Addressing </p>
<p align="center"> Figure 6-7 &nbsp; Query issues caused by deletion in open addressing </p>
To solve this problem, we can use a "lazy deletion" mechanism: instead of directly removing elements from the hash table, **use a constant `TOMBSTONE` to mark the bucket**. In this mechanism, both `None` and `TOMBSTONE` represent empty buckets and can hold key-value pairs. However, when linear probing encounters `TOMBSTONE`, it should continue traversing since there may still be key-value pairs below it.
@ -3090,7 +3090,7 @@ The code below implements an open addressing (linear probing) hash table with la
[class]{HashMapOpenAddressing}-[func]{}
```
### 2. &nbsp; Quadratic Probing
### 2. &nbsp; Quadratic probing
Quadratic probing is similar to linear probing and is one of the common strategies of open addressing. When a collision occurs, quadratic probing does not simply skip a fixed number of steps but skips "the square of the number of probes," i.e., $1, 4, 9, \dots$ steps.
@ -3104,12 +3104,12 @@ However, quadratic probing is not perfect:
- Clustering still exists, i.e., some positions are more likely to be occupied than others.
- Due to the growth of squares, quadratic probing may not probe the entire hash table, meaning it might not access empty buckets even if they exist in the hash table.
### 3. &nbsp; Double Hashing
### 3. &nbsp; Double hashing
As the name suggests, the double hashing method uses multiple hash functions $f_1(x)$, $f_2(x)$, $f_3(x)$, $\dots$ for probing.
- **Inserting Elements**: If hash function $f_1(x)$ encounters a conflict, try $f_2(x)$, and so on, until an empty position is found and the element is inserted.
- **Searching for Elements**: Search in the same order of hash functions until the target element is found and returned; if an empty position is encountered or all hash functions have been tried, it indicates the element is not in the hash table, then return `None`.
- **Inserting elements**: If hash function $f_1(x)$ encounters a conflict, try $f_2(x)$, and so on, until an empty position is found and the element is inserted.
- **Searching for elements**: Search in the same order of hash functions until the target element is found and returned; if an empty position is encountered or all hash functions have been tried, it indicates the element is not in the hash table, then return `None`.
Compared to linear probing, double hashing is less prone to clustering but involves additional computation for multiple hash functions.
@ -3117,7 +3117,7 @@ Compared to linear probing, double hashing is less prone to clustering but invol
Please note that open addressing (linear probing, quadratic probing, and double hashing) hash tables all have the issue of "not being able to directly delete elements."
## 6.2.3 &nbsp; Choice of Programming Languages
## 6.2.3 &nbsp; Choice of programming languages
Various programming languages have adopted different hash table implementation strategies, here are a few examples:

File diff suppressed because one or more lines are too long

View File

@ -3,9 +3,9 @@ comments: true
icon: material/table-search
---
# Chapter 6. &nbsp; Hash Table
# Chapter 6. &nbsp; Hash table
![Hash Table](../assets/covers/chapter_hashing.jpg){ class="cover-image" }
![Hash table](../assets/covers/chapter_hashing.jpg){ class="cover-image" }
!!! abstract
@ -15,7 +15,7 @@ icon: material/table-search
## Chapter Contents
- [6.1 &nbsp; Hash Table](https://www.hello-algo.com/en/chapter_hashing/hash_map/)
- [6.2 &nbsp; Hash Collision](https://www.hello-algo.com/en/chapter_hashing/hash_collision/)
- [6.3 &nbsp; Hash Algorithm](https://www.hello-algo.com/en/chapter_hashing/hash_algorithm/)
- [6.1 &nbsp; Hash table](https://www.hello-algo.com/en/chapter_hashing/hash_map/)
- [6.2 &nbsp; Hash collision](https://www.hello-algo.com/en/chapter_hashing/hash_collision/)
- [6.3 &nbsp; Hash algorithm](https://www.hello-algo.com/en/chapter_hashing/hash_algorithm/)
- [6.4 &nbsp; Summary](https://www.hello-algo.com/en/chapter_hashing/summary/)

View File

@ -4,7 +4,7 @@ comments: true
# 6.4 &nbsp; Summary
### 1. &nbsp; Key Review
### 1. &nbsp; Key review
- Given an input `key`, a hash table can retrieve the corresponding `value` in $O(1)$ time, which is highly efficient.
- Common hash table operations include querying, adding key-value pairs, deleting key-value pairs, and traversing the hash table.