mirror of
https://github.com/labuladong/fucking-algorithm.git
synced 2025-07-04 19:28:07 +08:00
Merge pull request #203 from ExcaliburEX/english
translated the KMP and modified the README.MD fixed #199
This commit is contained in:
@ -81,7 +81,7 @@ This command specifies the `english` branch and limit the depth of clone, get ri
|
||||
* [What is DP Optimal Substructure](dynamic_programming/OptimalSubstructure.md)
|
||||
* [动态规划详解](dynamic_programming/动态规划详解进阶.md)
|
||||
* [动态规划设计:最长递增子序列](dynamic_programming/动态规划设计:最长递增子序列.md)
|
||||
* [动态规划之KMP字符匹配算法](dynamic_programming/动态规划之KMP字符匹配算法.md)
|
||||
* [KMP](dynamic_programming/KMPCharacterMatchingAlgorithmInDynamicProgramming.md)
|
||||
* [团灭 LeetCode 股票买卖问题](dynamic_programming/团灭股票问题.md)
|
||||
* [团灭 LeetCode 打家劫舍问题](dynamic_programming/抢房子.md)
|
||||
|
||||
|
@ -0,0 +1,406 @@
|
||||
# KMP Character Matching Algorithm in Dynamic Programming
|
||||
|
||||
**Translator: [ExcaliburEX](https://github.com/ExcaliburEX)**
|
||||
|
||||
**Author: [labuladong](https://github.com/labuladong)**
|
||||
|
||||
The KMP algorithm (Knuth-Morris-Pratt algorithm) is a well-known string matching algorithm. It is very efficient, but it is a bit complicated.
|
||||
|
||||
Many readers complain that the KMP algorithm is incomprehensible. This is normal. When I think about the KMP algorithm explained in university textbooks, I don't know how many future Knuth, Morris, Pratt will be dismissed in advance. Some excellent students use the process of pushing the KMP algorithm to help understand the algorithm. This is a way, but this article will help the reader understand the principle of the algorithm from a logical level. Between ten lines of code, KMP died.
|
||||
|
||||
**First of all, at the beginning, this article uses `pat` to represent the pattern string, the length is `M`, `txt` represents the text string, and the length is `N`. The KMP algorithm is to find the substring `pat` in `txt`. If it exists, it returns the starting index of this substring, otherwise it returns -1**.
|
||||
|
||||
Why I think the KMP algorithm is a dynamic programming problem, I will explain it later. For dynamic programming, it has been emphasized many times that the meaning of the `dp` array must be clear, and the same problem may have more than one way to define the meaning of the `dp` array. Different definitions have different solutions.
|
||||
|
||||
The KMP algorithm that readers have seen is that a wave of weird operations processes `pat` to form a one-dimensional array `next`, and then passes through another wave of complex operations to match `txt`. Time complexity O (N), space complexity O (M). In fact, its `next` array is equivalent to `dp` array, and the meaning of the elements is related to the prefix and suffix of `pat`. The decision rules are complicated and difficult to understand.**This article uses a two-dimensional `dp` array (but the space complexity is still O (M)) to redefine the meaning of the elements, which greatly reduces the code length and greatly improves the interpretability**。
|
||||
|
||||
PS: The code of this article refers to "Algorithm 4". The name of the array used in the original code is `DFA` (Determining the Finite State Machine). Because our public account has a series of dynamic programming articles before, it will not say such a tall noun. Made a little modification to the code in the book and inherited the name of the `dp` array.
|
||||
|
||||
### I. Overview of KMP Algorithm
|
||||
|
||||
First, let's briefly introduce the differences between the KMP algorithm and the brute-force algorithm, the difficulties, and the relationship with dynamic programming.
|
||||
|
||||
The brute-force string matching algorithm is easy to write. Take a look at its logic:
|
||||
|
||||
```java
|
||||
// Brute-force matching (pseudo-code)
|
||||
int search(String pat, String txt) {
|
||||
int M = pat.length;
|
||||
int N = txt.length;
|
||||
for (int i = 0; i <= N - M; i++) {
|
||||
int j;
|
||||
for (j = 0; j < M; j++) {
|
||||
if (pat[j] != txt[i+j])
|
||||
break;
|
||||
}
|
||||
// pat all matches
|
||||
if (j == M) return i;
|
||||
}
|
||||
// pat substring does not exist in txt
|
||||
return -1;
|
||||
}
|
||||
```
|
||||
|
||||
For brute force algorithms, if there are mismatched characters, both pointers of `txt` and `pat` are rolled back, nested for loops, time complexity $O(MN)$, space complexity $O(1)$. The main problem is that if there are many repeated characters in the string, the algorithm seems stupid.
|
||||
|
||||
such as txt = "aaacaaab" pat = "aaab":
|
||||
|
||||

|
||||
|
||||
Obviously, there is no character c in `pat` at all, and it is not necessary to roll back the pointer `i`. The brute force solution obviously does a lot of unnecessary operations.
|
||||
|
||||
The KMP algorithm is different in that it takes space to record some information, which makes it smart in the above cases:
|
||||
|
||||

|
||||
|
||||
Another example is similar txt = "aaaaaaab" pat = "aaab". The brute force solution will go back to the pointer `i` as stupidly as the above example, and the KMP algorithm will be clever:
|
||||
|
||||

|
||||
|
||||
Because the KMP algorithm knows that the character a before the character b is matched, it is only necessary to compare whether the character b is matched every time.
|
||||
|
||||
**The KMP algorithm never rolls back the pointer `i` of `txt` and does not go back (it does not scan `txt` repeatedly), but uses the information stored in the `dp` array to move `pat` to the correct position to continue matching**. The time complexity only needs O(N), and space is used for time, so I think it is a dynamic programming algorithm.
|
||||
|
||||
The difficulty of the KMP algorithm is how to calculate the information in the `dp` array? How to move the pointer of pat correctly based on this information? This requires **Determining Finite State Automata** to assist. Don't be afraid of such a large literary vocabulary. In fact, it is exactly the same as the dynamic programming `dp` array. You can use this word to scare others when you learn it.
|
||||
|
||||
One more thing to be clear about: **Calculate the `dp` array, only related to the `pat` string**. It means that as long as you give me a `pat`, I can calculate the`dp` array from this pattern string, and then you can give me different `txt`, I am not afraid, I can use this `dp` array. String matching is done at O (N) time.
|
||||
|
||||
Specifically, for example, the two examples mentioned above:
|
||||
|
||||
```python
|
||||
txt1 = "aaacaaab"
|
||||
pat = "aaab"
|
||||
txt2 = "aaaaaaab"
|
||||
pat = "aaab"
|
||||
```
|
||||
|
||||
Our `txt` is different, but `pat` is the same, so the `dp` array used by the KMP algorithm is the same.
|
||||
|
||||
Just for the upcoming unmatched case of `txt1`:
|
||||
|
||||

|
||||
|
||||
The `dp` array instructs `pat` to move like this:
|
||||
|
||||

|
||||
|
||||
PS:This `j` should not be interpreted as an index, its meaning should be more accurately **state**, so it will appear in this strange position, which will be described later.
|
||||
|
||||
And for the following unmatched case of `txt2`:
|
||||
|
||||

|
||||
|
||||
The `dp` array instructs `pat` to move like this:
|
||||
|
||||

|
||||
|
||||
Understand that the `dp` array is only related to `pat`, so we will design the KMP algorithm more beautifully:
|
||||
|
||||
```java
|
||||
public class KMP {
|
||||
private int[][] dp;
|
||||
private String pat;
|
||||
|
||||
public KMP(String pat) {
|
||||
this.pat = pat;
|
||||
// Build dp array from pat
|
||||
// Requires O (M) time
|
||||
}
|
||||
|
||||
public int search(String txt) {
|
||||
// Match txt with dp array
|
||||
// O (N) time required
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
In this way, when we need to use the same `pat` to match different `txt`, we don't need to waste time constructing `dp` arrays:
|
||||
|
||||
```java
|
||||
KMP kmp = new KMP("aaab");
|
||||
int pos1 = kmp.search("aaacaaab"); //4
|
||||
int pos2 = kmp.search("aaaaaaab"); //4
|
||||
```
|
||||
|
||||
### Overview of State Machines
|
||||
|
||||
Why is the KMP algorithm related to the state machine? That's it, we can think of a `pat` match as a state transition. For example when pat = "ABABC":
|
||||
|
||||

|
||||
|
||||
As shown above, the number in the circle is the state, state 0 is the starting state, and state 5 (`pat.length`) is the ending state. At the beginning of the match, `pat` is in the starting state. Once it is transferred to the ending state, it means that `pat` was found in `txt`. For example, if it is currently in state 2, the character "AB" is matched:
|
||||
|
||||

|
||||
|
||||
In addition, `pat` state transition behaves differently in different states. For example, suppose that it now matches state 4. If it encounters character A, it should transition to state 3, if it encounters character C, it should transition to state 5, and if it encounters character B, it should transition to state 0:
|
||||
|
||||

|
||||
|
||||
What does it mean, let's take a look at each example. Use the variable `j` to indicate a pointer to the current state. The current `pat` matches state 4:
|
||||
|
||||

|
||||
|
||||
If the character "A" is encountered, it is the smartest to transition to state 3 as indicated by the arrow:
|
||||
|
||||

|
||||
|
||||
If the character "B" is encountered, as indicated by the arrow, it can only be transferred to state 0 (returning to liberation overnight):
|
||||
|
||||

|
||||
|
||||
If the character "C" is encountered, it should transition to the termination state 5 according to the arrow, which means that the match is complete:
|
||||
|
||||

|
||||
|
||||
|
||||
Of course, you may also encounter other characters, such as Z, but you should obviously move to the starting state 0, because there is no character Z at all in `pat`:
|
||||
|
||||

|
||||
|
||||
Here for clarity, when we draw the state diagram, the arrows that transfer other characters to state 0 are omitted, and only the state transition of the characters appearing in `pat` is drawn:
|
||||
|
||||

|
||||
|
||||
The most critical step of the KMP algorithm is to construct this state transition diagram. **To determine the behavior of the state transition, two variables must be specified, one is the current matching state and the other is the character encountered**; Once these two variables are determined, you can know which one to transfer in this case status.
|
||||
|
||||
Let's take a look at the process of the KMP algorithm matching the string `txt` according to this state transition diagram:
|
||||
|
||||

|
||||
|
||||
**Remember this GIF matching process, this is the core logic of the KMP algorithm**!
|
||||
|
||||
To describe the state transition diagram, we define a two-dimensional dp array, which has the following meaning:
|
||||
|
||||
```python
|
||||
dp[j][c] = next
|
||||
0 <= j < M,The current state of the table
|
||||
0 <= c < 256,Character encountered (ASCII code)
|
||||
0 <= next <= M,Represents the next state
|
||||
|
||||
dp[4]['A'] = 3 Means:
|
||||
The current state is 4, if the character A is encountered,
|
||||
pat should go to state 3
|
||||
|
||||
dp[1]['B'] = 2 Means:
|
||||
Current state 1, if character B is encountered,
|
||||
pat should transition to state 2
|
||||
```
|
||||
|
||||
According to the definition of our dp array and the process of state transition just now, we can first write the search function code of the KMP algorithm:
|
||||
|
||||
```java
|
||||
public int search(String txt) {
|
||||
int M = pat.length();
|
||||
int N = txt.length();
|
||||
// The initial state of pat is 0
|
||||
int j = 0;
|
||||
for (int i = 0; i < N; i++) {
|
||||
// The current state is j. The character txt [i] is encountered.
|
||||
// Which state should pat go to?
|
||||
j = dp[j][txt.charAt(i)];
|
||||
// If the termination state is reached, the index at the beginning of the match is returned
|
||||
if (j == M) return i - M + 1;
|
||||
}
|
||||
// Not reached termination state, matching failed
|
||||
return -1;
|
||||
}
|
||||
```
|
||||
|
||||
At this point, it should still be well understood. The `dp` array is the state transition diagram we just drew. If not clear, go back and see the evolution of the GIF algorithm. Here's how to build this `dp` array via `pat`?
|
||||
|
||||
### III. Building a state transition diagram
|
||||
|
||||
Recall what just said:**To determine the behavior of state transitions, two variables must be specified, one is the current matching state and the other is the character encountered**, And we have determined the meaning of the `dp` array according to this logic, then the framework for constructing the `dp` array is like this:
|
||||
|
||||
```python
|
||||
for 0 <= j < M: # status
|
||||
for 0 <= c < 256: # character
|
||||
dp[j][c] = next
|
||||
```
|
||||
|
||||
How should we find this next state? Obviously, **If the characters `c` and `pat[j]` match**, the state should move forward by one. That is, `next = j + 1`. We might as well call this situation **state advance**:
|
||||
|
||||

|
||||
|
||||
If the characters `c` and `pat[j]` do not match, the state will roll back (or stay in place), we might as well call this situation **state restart**:
|
||||
|
||||

|
||||
|
||||
So, how do you know in which state to restart? Before answering this question, we define another name: **Shadow State** (i named), which is represented by the variable `X`. **The so-called shadow state has the same prefix as the current state**. For example:
|
||||
|
||||

|
||||
|
||||
The current state `j = 4`, its shadow state is `X = 2`, and they all have the same prefix "AB". Because the state `X` and the state `j` have the same prefix, when the state `j` is ready for state restart (the characters encountered `c` and `pat[j]` do not match), you can use `X` State transition diagram to get **recent restart position**.
|
||||
|
||||
For example, if the state `j` encountered a character "A", where should it go? First of all, the state can only be advanced if it encounters "C". Obviously, it can only restart the state when it encounters "A". **State `j` will delegate this character to state `X` processing, which is `dp[j]['A'] = dp[X]['A']`**:
|
||||
|
||||

|
||||
|
||||
Why is this possible? Because: Since `j` has been determined that the character "A" cannot be advanced, **can only be rolled back**, and KMP wants to **roll back as little as possible** to avoid unnecessary calculations. Then `j` can ask `X` with the same prefix as itself. If `X` meets "A" and can perform "state advancement", then it will be transferred, because it will have the least rollback.
|
||||
|
||||

|
||||
|
||||
Of course, if the character encountered is "B", the state `X` cannot be "state advanced" and can only be rolled back. `j` just needs to roll back in the direction of `X`:
|
||||
|
||||

|
||||
|
||||
You may ask, how does this `X` know that when it encounters the character "B", it will fall back to state 0? Because `X` always follows behind `j`, how the state `X` shifts has been calculated before. Doesn't dynamic programming algorithm use past results to solve current problems?
|
||||
|
||||
In this way, we will refine the framework code just now:
|
||||
|
||||
```python
|
||||
int X # Shadow state
|
||||
for 0 <= j < M:
|
||||
for 0 <= c < 256:
|
||||
if c == pat[j]:
|
||||
# State advancement
|
||||
dp[j][c] = j + 1
|
||||
else:
|
||||
# State restart
|
||||
# Delegate X to calculate restart position
|
||||
dp[j][c] = dp[X][c]
|
||||
```
|
||||
|
||||
### IX. Code Implementation
|
||||
|
||||
If you can understand the previous content, congratulations! Now there is one question left: how did the shadow state `X` get? Let's look directly at the complete code.
|
||||
|
||||
```java
|
||||
public class KMP {
|
||||
private int[][] dp;
|
||||
private String pat;
|
||||
|
||||
public KMP(String pat) {
|
||||
this.pat = pat;
|
||||
int M = pat.length();
|
||||
// dp[state][character] = next state
|
||||
dp = new int[M][256];
|
||||
// base case
|
||||
dp[0][pat.charAt(0)] = 1;
|
||||
// Shadow state X is initially 0
|
||||
int X = 0;
|
||||
// Current state j starts at 1
|
||||
for (int j = 1; j < M; j++) {
|
||||
for (int c = 0; c < 256; c++) {
|
||||
if (pat.charAt(j) == c)
|
||||
dp[j][c] = j + 1;
|
||||
else
|
||||
dp[j][c] = dp[X][c];
|
||||
}
|
||||
// Update shadow status
|
||||
X = dp[X][pat.charAt(j)];
|
||||
}
|
||||
}
|
||||
|
||||
public int search(String txt) {...}
|
||||
}
|
||||
```
|
||||
|
||||
First explain this line of code:
|
||||
|
||||
```java
|
||||
// base case
|
||||
dp[0][pat.charAt(0)] = 1;
|
||||
```
|
||||
|
||||
This line of code is a base case. Only when the character pat[0] is encountered can the state transition from 0 to 1. If it encounters other characters, it stays at state 0 (Java initializes the array to 0 by default).
|
||||
|
||||
The shadow state `X` is first initialized to 0 and then continuously updated as `j` advances. Let's see **how to update the shadow state `X`** in the end:
|
||||
|
||||
```java
|
||||
int X = 0;
|
||||
for (int j = 1; j < M; j++) {
|
||||
...
|
||||
// Update shadow status
|
||||
// The current state is X, the character pat[j] is encountered,
|
||||
// Which state should pat go to?
|
||||
X = dp[X][pat.charAt(j)];
|
||||
}
|
||||
```
|
||||
|
||||
Updating `X` is actually very similar to updating the status `j` in the `search` function:
|
||||
|
||||
```java
|
||||
int j = 0;
|
||||
for (int i = 0; i < N; i++) {
|
||||
// The current state is j. The character txt[i] is encountered.
|
||||
// Which state should pat go to?
|
||||
j = dp[j][txt.charAt(i)];
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
**The principle is very delicate**, pay attention to the initial value of the variable in the for loop in the code, you can understand this: the latter is matching `pat` in `txt`, the former is matching `pat[1.end]`, state `X` is always one state behind state `j`, with the same longest prefix as `j`. So I compare `X` to a shadow state, and it seems a bit appropriate.
|
||||
|
||||
In addition, constructing the dp array is based on the base case `dp[0][..]`. This is why I consider the KMP algorithm to be a dynamic programming algorithm.
|
||||
|
||||
Take a look at the complete construction process of the state transition diagram, you can understand the subtlety of the role of the state `X`:
|
||||
|
||||

|
||||
|
||||
At this point, the core of the KMP algorithm has finally been written. Take a look at the complete code of the KMP algorithm:
|
||||
|
||||
```java
|
||||
public class KMP {
|
||||
private int[][] dp;
|
||||
private String pat;
|
||||
|
||||
public KMP(String pat) {
|
||||
this.pat = pat;
|
||||
int M = pat.length();
|
||||
// dp[state][character] = next state
|
||||
dp = new int[M][256];
|
||||
// base case
|
||||
dp[0][pat.charAt(0)] = 1;
|
||||
// Shadow state X is initially 0
|
||||
int X = 0;
|
||||
// Build state transition diagram (slightly more compact)
|
||||
for (int j = 1; j < M; j++) {
|
||||
for (int c = 0; c < 256; c++)
|
||||
dp[j][c] = dp[X][c];
|
||||
dp[j][pat.charAt(j)] = j + 1;
|
||||
// Update shadow status
|
||||
X = dp[X][pat.charAt(j)];
|
||||
}
|
||||
}
|
||||
|
||||
public int search(String txt) {
|
||||
int M = pat.length();
|
||||
int N = txt.length();
|
||||
// The initial state of pat is 0
|
||||
int j = 0;
|
||||
for (int i = 0; i < N; i++) {
|
||||
// Calculate the next state of pat
|
||||
j = dp[j][txt.charAt(i)];
|
||||
// Reached termination state and returned results
|
||||
if (j == M) return i - M + 1;
|
||||
}
|
||||
// Not reached termination state, matching failed
|
||||
return -1;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
After the previous detailed examples, you should understand the meaning of this code. Of course, you can also write the KMP algorithm as a function. The core code is the part of the for loop in the two functions. Is there more than ten lines in the count?
|
||||
|
||||
### V. Conclusion
|
||||
|
||||
The traditional KMP algorithm uses a one-dimensional array `next` to record prefix information, and this article uses a two-dimensional array `dp` to solve the character matching problem from the perspective of state transition, but the space complexity is still O(256M) = O(M).
|
||||
|
||||
|
||||
In the process of `pat` matching `txt`, as long as the two questions of "current state" and "what characters are encountered" are clear, it can be determined which state should be transferred (forward or back) .
|
||||
|
||||
For a pattern string `pat`, there are a total of M states, and for ASCII characters, the total will not exceed 256. So we construct an array `dp[M][256]` to include all cases, and make clear the meaning of the `dp` array:
|
||||
|
||||
`dp[j][c] = next` means that the current state is `j`, the character `c` is encountered, and it should move to the state `next`.
|
||||
|
||||
With its meaning clear, it is easy to write the code for the search function.
|
||||
|
||||
For how to build this `dp` array, you need a secondary state `X`, which is always one state behind the current state `j`, with the same prefix as the longest `j`. We named it "Shadow State".
|
||||
|
||||
When constructing the transition direction of the current state `j`, only the character `pat[j]` can advance the state (`dp[j][pat[j]] = j + 1`); for other characters only State fallback, you should ask where the shadow state `X` should fall back (`dp[j][other] = dp[X][other]`, where `other` is other than `pat[j]` all other characters).
|
||||
|
||||
For the shadow state `X`, we initialize it to 0 and update it as `j` advances. The update method is very similar to the search process to update `j` (`X = dp[X][pat[j]]`).
|
||||
|
||||
The KMP algorithm is also a matter of dynamic programming. Our public account article directory has a series of articles that specialize in dynamic programming, and all are based on a set of frameworks. It is nothing more than describing the logic of the problem, clarifying the meaning of the `dp` array and defining the base case. That's a shit. I hope this article will give you a deeper understanding of dynamic programming.
|
||||
|
@ -1,404 +0,0 @@
|
||||
# 动态规划之KMP字符匹配算法
|
||||
|
||||
KMP 算法(Knuth-Morris-Pratt 算法)是一个著名的字符串匹配算法,效率很高,但是确实有点复杂。
|
||||
|
||||
很多读者抱怨 KMP 算法无法理解,这很正常,想到大学教材上关于 KMP 算法的讲解,也不知道有多少未来的 Knuth、Morris、Pratt 被提前劝退了。有一些优秀的同学通过手推 KMP 算法的过程来辅助理解该算法,这是一种办法,不过本文要从逻辑层面帮助读者理解算法的原理。十行代码之间,KMP 灰飞烟灭。
|
||||
|
||||
**先在开头约定,本文用 `pat` 表示模式串,长度为 `M`,`txt` 表示文本串,长度为 `N`。KMP 算法是在 `txt` 中查找子串 `pat`,如果存在,返回这个子串的起始索引,否则返回 -1**。
|
||||
|
||||
为什么我认为 KMP 算法就是个动态规划问题呢,等会再解释。对于动态规划,之前多次强调了要明确 `dp` 数组的含义,而且同一个问题可能有不止一种定义 `dp` 数组含义的方法,不同的定义会有不同的解法。
|
||||
|
||||
读者见过的 KMP 算法应该是,一波诡异的操作处理 `pat` 后形成一个一维的数组 `next`,然后根据这个数组经过又一波复杂操作去匹配 `txt`。时间复杂度 O(N),空间复杂度 O(M)。其实它这个 `next` 数组就相当于 `dp` 数组,其中元素的含义跟 `pat` 的前缀和后缀有关,判定规则比较复杂,不好理解。**本文则用一个二维的 `dp` 数组(但空间复杂度还是 O(M)),重新定义其中元素的含义,使得代码长度大大减少,可解释性大大提高**。
|
||||
|
||||
PS:本文的代码参考《算法4》,原代码使用的数组名称是 `dfa`(确定有限状态机),因为我们的公众号之前有一系列动态规划的文章,就不说这么高大上的名词了,我对书中代码进行了一点修改,并沿用 `dp` 数组的名称。
|
||||
|
||||
### 一、KMP 算法概述
|
||||
|
||||
首先还是简单介绍一下 KMP 算法和暴力匹配算法的不同在哪里,难点在哪里,和动态规划有啥关系。
|
||||
|
||||
暴力的字符串匹配算法很容易写,看一下它的运行逻辑:
|
||||
|
||||
```java
|
||||
// 暴力匹配(伪码)
|
||||
int search(String pat, String txt) {
|
||||
int M = pat.length;
|
||||
int N = txt.length;
|
||||
for (int i = 0; i <= N - M; i++) {
|
||||
int j;
|
||||
for (j = 0; j < M; j++) {
|
||||
if (pat[j] != txt[i+j])
|
||||
break;
|
||||
}
|
||||
// pat 全都匹配了
|
||||
if (j == M) return i;
|
||||
}
|
||||
// txt 中不存在 pat 子串
|
||||
return -1;
|
||||
}
|
||||
```
|
||||
|
||||
对于暴力算法,如果出现不匹配字符,同时回退 `txt` 和 `pat` 的指针,嵌套 for 循环,时间复杂度 $O(MN)$,空间复杂度$O(1)$。最主要的问题是,如果字符串中重复的字符比较多,该算法就显得很蠢。
|
||||
|
||||
比如 txt = "aaacaaab" pat = "aaab":
|
||||
|
||||

|
||||
|
||||
很明显,`pat` 中根本没有字符 c,根本没必要回退指针 `i`,暴力解法明显多做了很多不必要的操作。
|
||||
|
||||
KMP 算法的不同之处在于,它会花费空间来记录一些信息,在上述情况中就会显得很聪明:
|
||||
|
||||

|
||||
|
||||
再比如类似的 txt = "aaaaaaab" pat = "aaab",暴力解法还会和上面那个例子一样蠢蠢地回退指针 `i`,而 KMP 算法又会耍聪明:
|
||||
|
||||

|
||||
|
||||
因为 KMP 算法知道字符 b 之前的字符 a 都是匹配的,所以每次只需要比较字符 b 是否被匹配就行了。
|
||||
|
||||
**KMP 算法永不回退 `txt` 的指针 `i`,不走回头路(不会重复扫描 `txt`),而是借助 `dp` 数组中储存的信息把 `pat` 移到正确的位置继续匹配**,时间复杂度只需 O(N),用空间换时间,所以我认为它是一种动态规划算法。
|
||||
|
||||
KMP 算法的难点在于,如何计算 `dp` 数组中的信息?如何根据这些信息正确地移动 `pat` 的指针?这个就需要**确定有限状态自动机**来辅助了,别怕这种高大上的文学词汇,其实和动态规划的 `dp` 数组如出一辙,等你学会了也可以拿这个词去吓唬别人。
|
||||
|
||||
还有一点需要明确的是:**计算这个 `dp` 数组,只和 `pat` 串有关**。意思是说,只要给我个 `pat`,我就能通过这个模式串计算出 `dp` 数组,然后你可以给我不同的 `txt`,我都不怕,利用这个 `dp` 数组我都能在 O(N) 时间完成字符串匹配。
|
||||
|
||||
具体来说,比如上文举的两个例子:
|
||||
|
||||
```python
|
||||
txt1 = "aaacaaab"
|
||||
pat = "aaab"
|
||||
txt2 = "aaaaaaab"
|
||||
pat = "aaab"
|
||||
```
|
||||
|
||||
我们的 `txt` 不同,但是 `pat` 是一样的,所以 KMP 算法使用的 `dp` 数组是同一个。
|
||||
|
||||
只不过对于 `txt1` 的下面这个即将出现的未匹配情况:
|
||||
|
||||

|
||||
|
||||
`dp` 数组指示 `pat` 这样移动:
|
||||
|
||||

|
||||
|
||||
PS:这个`j` 不要理解为索引,它的含义更准确地说应该是**状态**(state),所以它会出现这个奇怪的位置,后文会详述。
|
||||
|
||||
而对于 `txt2` 的下面这个即将出现的未匹配情况:
|
||||
|
||||

|
||||
|
||||
`dp` 数组指示 `pat` 这样移动:
|
||||
|
||||

|
||||
|
||||
明白了 `dp` 数组只和 `pat` 有关,那么我们这样设计 KMP 算法就会比较漂亮:
|
||||
|
||||
```java
|
||||
public class KMP {
|
||||
private int[][] dp;
|
||||
private String pat;
|
||||
|
||||
public KMP(String pat) {
|
||||
this.pat = pat;
|
||||
// 通过 pat 构建 dp 数组
|
||||
// 需要 O(M) 时间
|
||||
}
|
||||
|
||||
public int search(String txt) {
|
||||
// 借助 dp 数组去匹配 txt
|
||||
// 需要 O(N) 时间
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
这样,当我们需要用同一 `pat` 去匹配不同 `txt` 时,就不需要浪费时间构造 `dp` 数组了:
|
||||
|
||||
```java
|
||||
KMP kmp = new KMP("aaab");
|
||||
int pos1 = kmp.search("aaacaaab"); //4
|
||||
int pos2 = kmp.search("aaaaaaab"); //4
|
||||
```
|
||||
|
||||
### 二、状态机概述
|
||||
|
||||
为什么说 KMP 算法和状态机有关呢?是这样的,我们可以认为 `pat` 的匹配就是状态的转移。比如当 pat = "ABABC":
|
||||
|
||||

|
||||
|
||||
如上图,圆圈内的数字就是状态,状态 0 是起始状态,状态 5(`pat.length`)是终止状态。开始匹配时 `pat` 处于起始状态,一旦转移到终止状态,就说明在 `txt` 中找到了 `pat`。比如说当前处于状态 2,就说明字符 "AB" 被匹配:
|
||||
|
||||

|
||||
|
||||
另外,处于不同状态时,`pat` 状态转移的行为也不同。比如说假设现在匹配到了状态 4,如果遇到字符 A 就应该转移到状态 3,遇到字符 C 就应该转移到状态 5,如果遇到字符 B 就应该转移到状态 0:
|
||||
|
||||

|
||||
|
||||
具体什么意思呢,我们来一个个举例看看。用变量 `j` 表示指向当前状态的指针,当前 `pat` 匹配到了状态 4:
|
||||
|
||||

|
||||
|
||||
如果遇到了字符 "A",根据箭头指示,转移到状态 3 是最聪明的:
|
||||
|
||||

|
||||
|
||||
如果遇到了字符 "B",根据箭头指示,只能转移到状态 0(一夜回到解放前):
|
||||
|
||||

|
||||
|
||||
如果遇到了字符 "C",根据箭头指示,应该转移到终止状态 5,这也就意味着匹配完成:
|
||||
|
||||

|
||||
|
||||
|
||||
当然了,还可能遇到其他字符,比如 Z,但是显然应该转移到起始状态 0,因为 `pat` 中根本都没有字符 Z:
|
||||
|
||||

|
||||
|
||||
这里为了清晰起见,我们画状态图时就把其他字符转移到状态 0 的箭头省略,只画 `pat` 中出现的字符的状态转移:
|
||||
|
||||

|
||||
|
||||
KMP 算法最关键的步骤就是构造这个状态转移图。**要确定状态转移的行为,得明确两个变量,一个是当前的匹配状态,另一个是遇到的字符**;确定了这两个变量后,就可以知道这个情况下应该转移到哪个状态。
|
||||
|
||||
下面看一下 KMP 算法根据这幅状态转移图匹配字符串 `txt` 的过程:
|
||||
|
||||

|
||||
|
||||
**请记住这个 GIF 的匹配过程,这就是 KMP 算法的核心逻辑**!
|
||||
|
||||
为了描述状态转移图,我们定义一个二维 dp 数组,它的含义如下:
|
||||
|
||||
```python
|
||||
dp[j][c] = next
|
||||
0 <= j < M,代表当前的状态
|
||||
0 <= c < 256,代表遇到的字符(ASCII 码)
|
||||
0 <= next <= M,代表下一个状态
|
||||
|
||||
dp[4]['A'] = 3 表示:
|
||||
当前是状态 4,如果遇到字符 A,
|
||||
pat 应该转移到状态 3
|
||||
|
||||
dp[1]['B'] = 2 表示:
|
||||
当前是状态 1,如果遇到字符 B,
|
||||
pat 应该转移到状态 2
|
||||
```
|
||||
|
||||
根据我们这个 dp 数组的定义和刚才状态转移的过程,我们可以先写出 KMP 算法的 search 函数代码:
|
||||
|
||||
```java
|
||||
public int search(String txt) {
|
||||
int M = pat.length();
|
||||
int N = txt.length();
|
||||
// pat 的初始态为 0
|
||||
int j = 0;
|
||||
for (int i = 0; i < N; i++) {
|
||||
// 当前是状态 j,遇到字符 txt[i],
|
||||
// pat 应该转移到哪个状态?
|
||||
j = dp[j][txt.charAt(i)];
|
||||
// 如果达到终止态,返回匹配开头的索引
|
||||
if (j == M) return i - M + 1;
|
||||
}
|
||||
// 没到达终止态,匹配失败
|
||||
return -1;
|
||||
}
|
||||
```
|
||||
|
||||
到这里,应该还是很好理解的吧,`dp` 数组就是我们刚才画的那幅状态转移图,如果不清楚的话回去看下 GIF 的算法演进过程。下面讲解:如何通过 `pat` 构建这个 `dp` 数组?
|
||||
|
||||
### 三、构建状态转移图
|
||||
|
||||
回想刚才说的:**要确定状态转移的行为,必须明确两个变量,一个是当前的匹配状态,另一个是遇到的字符**,而且我们已经根据这个逻辑确定了 `dp` 数组的含义,那么构造 `dp` 数组的框架就是这样:
|
||||
|
||||
```python
|
||||
for 0 <= j < M: # 状态
|
||||
for 0 <= c < 256: # 字符
|
||||
dp[j][c] = next
|
||||
```
|
||||
|
||||
这个 next 状态应该怎么求呢?显然,**如果遇到的字符 `c` 和 `pat[j]` 匹配的话**,状态就应该向前推进一个,也就是说 `next = j + 1`,我们不妨称这种情况为**状态推进**:
|
||||
|
||||

|
||||
|
||||
**如果字符 `c` 和 `pat[j]` 不匹配的话**,状态就要回退(或者原地不动),我们不妨称这种情况为**状态重启**:
|
||||
|
||||

|
||||
|
||||
那么,如何得知在哪个状态重启呢?解答这个问题之前,我们再定义一个名字:**影子状态**(我编的名字),用变量 `X` 表示。**所谓影子状态,就是和当前状态具有相同的前缀**。比如下面这种情况:
|
||||
|
||||

|
||||
|
||||
当前状态 `j = 4`,其影子状态为 `X = 2`,它们都有相同的前缀 "AB"。因为状态 `X` 和状态 `j` 存在相同的前缀,所以当状态 `j` 准备进行状态重启的时候(遇到的字符 `c` 和 `pat[j]` 不匹配),可以通过 `X` 的状态转移图来获得**最近的重启位置**。
|
||||
|
||||
比如说刚才的情况,如果状态 `j` 遇到一个字符 "A",应该转移到哪里呢?首先只有遇到 "C" 才能推进状态,遇到 "A" 显然只能进行状态重启。**状态 `j` 会把这个字符委托给状态 `X` 处理,也就是 `dp[j]['A'] = dp[X]['A']`**:
|
||||
|
||||

|
||||
|
||||
为什么这样可以呢?因为:既然 `j` 这边已经确定字符 "A" 无法推进状态,**只能回退**,而且 KMP 就是要**尽可能少的回退**,以免多余的计算。那么 `j` 就可以去问问和自己具有相同前缀的 `X`,如果 `X` 遇见 "A" 可以进行「状态推进」,那就转移过去,因为这样回退最少。
|
||||
|
||||

|
||||
|
||||
当然,如果遇到的字符是 "B",状态 `X` 也不能进行「状态推进」,只能回退,`j` 只要跟着 `X` 指引的方向回退就行了:
|
||||
|
||||

|
||||
|
||||
你也许会问,这个 `X` 怎么知道遇到字符 "B" 要回退到状态 0 呢?因为 `X` 永远跟在 `j` 的身后,状态 `X` 如何转移,在之前就已经算出来了。动态规划算法不就是利用过去的结果解决现在的问题吗?
|
||||
|
||||
这样,我们就细化一下刚才的框架代码:
|
||||
|
||||
```python
|
||||
int X # 影子状态
|
||||
for 0 <= j < M:
|
||||
for 0 <= c < 256:
|
||||
if c == pat[j]:
|
||||
# 状态推进
|
||||
dp[j][c] = j + 1
|
||||
else:
|
||||
# 状态重启
|
||||
# 委托 X 计算重启位置
|
||||
dp[j][c] = dp[X][c]
|
||||
```
|
||||
|
||||
### 四、代码实现
|
||||
|
||||
如果之前的内容你都能理解,恭喜你,现在就剩下一个问题:影子状态 `X` 是如何得到的呢?下面先直接看完整代码吧。
|
||||
|
||||
```java
|
||||
public class KMP {
|
||||
private int[][] dp;
|
||||
private String pat;
|
||||
|
||||
public KMP(String pat) {
|
||||
this.pat = pat;
|
||||
int M = pat.length();
|
||||
// dp[状态][字符] = 下个状态
|
||||
dp = new int[M][256];
|
||||
// base case
|
||||
dp[0][pat.charAt(0)] = 1;
|
||||
// 影子状态 X 初始为 0
|
||||
int X = 0;
|
||||
// 当前状态 j 从 1 开始
|
||||
for (int j = 1; j < M; j++) {
|
||||
for (int c = 0; c < 256; c++) {
|
||||
if (pat.charAt(j) == c)
|
||||
dp[j][c] = j + 1;
|
||||
else
|
||||
dp[j][c] = dp[X][c];
|
||||
}
|
||||
// 更新影子状态
|
||||
X = dp[X][pat.charAt(j)];
|
||||
}
|
||||
}
|
||||
|
||||
public int search(String txt) {...}
|
||||
}
|
||||
```
|
||||
|
||||
先解释一下这一行代码:
|
||||
|
||||
```java
|
||||
// base case
|
||||
dp[0][pat.charAt(0)] = 1;
|
||||
```
|
||||
|
||||
这行代码是 base case,只有遇到 pat[0] 这个字符才能使状态从 0 转移到 1,遇到其它字符的话还是停留在状态 0(Java 默认初始化数组全为 0)。
|
||||
|
||||
影子状态 `X` 是先初始化为 0,然后随着 `j` 的前进而不断更新的。下面看看到底应该**如何更新影子状态 `X`**:
|
||||
|
||||
```java
|
||||
int X = 0;
|
||||
for (int j = 1; j < M; j++) {
|
||||
...
|
||||
// 更新影子状态
|
||||
// 当前是状态 X,遇到字符 pat[j],
|
||||
// pat 应该转移到哪个状态?
|
||||
X = dp[X][pat.charAt(j)];
|
||||
}
|
||||
```
|
||||
|
||||
更新 `X` 其实和 `search` 函数中更新状态 `j` 的过程是非常相似的:
|
||||
|
||||
```java
|
||||
int j = 0;
|
||||
for (int i = 0; i < N; i++) {
|
||||
// 当前是状态 j,遇到字符 txt[i],
|
||||
// pat 应该转移到哪个状态?
|
||||
j = dp[j][txt.charAt(i)];
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
**其中的原理非常微妙**,注意代码中 for 循环的变量初始值,可以这样理解:后者是在 `txt` 中匹配 `pat`,前者是在 `pat` 中匹配 `pat[1..end]`,状态 `X` 总是落后状态 `j` 一个状态,与 `j` 具有最长的相同前缀。所以我把 `X` 比喻为影子状态,似乎也有一点贴切。
|
||||
|
||||
另外,构建 dp 数组是根据 base case `dp[0][..]` 向后推演。这就是我认为 KMP 算法就是一种动态规划算法的原因。
|
||||
|
||||
下面来看一下状态转移图的完整构造过程,你就能理解状态 `X` 作用之精妙了:
|
||||
|
||||

|
||||
|
||||
至此,KMP 算法的核心终于写完啦啦啦啦!看下 KMP 算法的完整代码吧:
|
||||
|
||||
```java
|
||||
public class KMP {
|
||||
private int[][] dp;
|
||||
private String pat;
|
||||
|
||||
public KMP(String pat) {
|
||||
this.pat = pat;
|
||||
int M = pat.length();
|
||||
// dp[状态][字符] = 下个状态
|
||||
dp = new int[M][256];
|
||||
// base case
|
||||
dp[0][pat.charAt(0)] = 1;
|
||||
// 影子状态 X 初始为 0
|
||||
int X = 0;
|
||||
// 构建状态转移图(稍改的更紧凑了)
|
||||
for (int j = 1; j < M; j++) {
|
||||
for (int c = 0; c < 256; c++)
|
||||
dp[j][c] = dp[X][c];
|
||||
dp[j][pat.charAt(j)] = j + 1;
|
||||
// 更新影子状态
|
||||
X = dp[X][pat.charAt(j)];
|
||||
}
|
||||
}
|
||||
|
||||
public int search(String txt) {
|
||||
int M = pat.length();
|
||||
int N = txt.length();
|
||||
// pat 的初始态为 0
|
||||
int j = 0;
|
||||
for (int i = 0; i < N; i++) {
|
||||
// 计算 pat 的下一个状态
|
||||
j = dp[j][txt.charAt(i)];
|
||||
// 到达终止态,返回结果
|
||||
if (j == M) return i - M + 1;
|
||||
}
|
||||
// 没到达终止态,匹配失败
|
||||
return -1;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
经过之前的详细举例讲解,你应该可以理解这段代码的含义了,当然你也可以把 KMP 算法写成一个函数。核心代码也就是两个函数中 for 循环的部分,数一下有超过十行吗?
|
||||
|
||||
### 五、最后总结
|
||||
|
||||
传统的 KMP 算法是使用一个一维数组 `next` 记录前缀信息,而本文是使用一个二维数组 `dp` 以状态转移的角度解决字符匹配问题,但是空间复杂度仍然是 O(256M) = O(M)。
|
||||
|
||||
在 `pat` 匹配 `txt` 的过程中,只要明确了「当前处在哪个状态」和「遇到的字符是什么」这两个问题,就可以确定应该转移到哪个状态(推进或回退)。
|
||||
|
||||
对于一个模式串 `pat`,其总共就有 M 个状态,对于 ASCII 字符,总共不会超过 256 种。所以我们就构造一个数组 `dp[M][256]` 来包含所有情况,并且明确 `dp` 数组的含义:
|
||||
|
||||
`dp[j][c] = next` 表示,当前是状态 `j`,遇到了字符 `c`,应该转移到状态 `next`。
|
||||
|
||||
明确了其含义,就可以很容易写出 search 函数的代码。
|
||||
|
||||
对于如何构建这个 `dp` 数组,需要一个辅助状态 `X`,它永远比当前状态 `j` 落后一个状态,拥有和 `j` 最长的相同前缀,我们给它起了个名字叫「影子状态」。
|
||||
|
||||
在构建当前状态 `j` 的转移方向时,只有字符 `pat[j]` 才能使状态推进(`dp[j][pat[j]] = j+1`);而对于其他字符只能进行状态回退,应该去请教影子状态 `X` 应该回退到哪里(`dp[j][other] = dp[X][other]`,其中 `other` 是除了 `pat[j]` 之外所有字符)。
|
||||
|
||||
对于影子状态 `X`,我们把它初始化为 0,并且随着 `j` 的前进进行更新,更新的方式和 search 过程更新 `j` 的过程非常相似(`X = dp[X][pat[j]]`)。
|
||||
|
||||
KMP 算法也就是动态规划那点事,我们的公众号文章目录有一系列专门讲动态规划的,而且都是按照一套框架来的,无非就是描述问题逻辑,明确 `dp` 数组含义,定义 base case 这点破事。希望这篇文章能让大家对动态规划有更深的理解。
|
||||
|
||||
**致力于把算法讲清楚!欢迎关注我的微信公众号 labuladong,查看更多通俗易懂的文章**:
|
||||
|
||||

|
Reference in New Issue
Block a user