Add Regular Expression - Dynamic Programming

1. 更改 pictures\正则\example.png内容,另存为 regex_example.jpg;
2. pictures\正则\title.png 内容以markdown 代码块形式呈现。
This commit is contained in:
PaperJetJia
2020-02-29 21:51:28 +08:00
parent a962ba42d1
commit 43d034c677
5 changed files with 210 additions and 170 deletions

View File

@ -0,0 +1,210 @@
The previous article, "Dynamic Programming in Detail," was very well-received. Today, I'm going to talk about a practical application: Regular Expression. I highly recommend you to take a look at the previous article, if you don't know what is "Dynamic Programming".
Regular Expression is an ingenious algorithm but is a little bit hard to understand. This article mainly focuses on the implementation of two Regular Expression symbols: period「.」and asterisk「*」. Don't worry if you have never used Regular Expression; I will introduce it later. At the end of this article, I'll share with you a tip to quickly find the overlapping subproblems.
Another important purpose of this article is to teach readers how to design their own algorithms. Sometimes it's hard to understand others' comprehensive algorithms. And we even feel exhausted. But in this article, I just want to tell you that designing an algorithm is not a piece of cake; instead, it's a continuing upward spiral of progress. You need to work for it, refine it and perfect it. I'll try my best to make you know how do we simplify problems and build the final solutions from the simplest framework.
Without wasting time, let's dive into it:
Given a string(s) and a string mode(p). Implement the Regular Expression that supports the '.' and '*' match.
```
'.' matches any single character
'*' matches zero or more before "*"
Note: Matches should cover the whole string, not part of it.
```
Demo1:
```text
Input:
s = "aa"
p = "a*"
Output:
true
Explanation:
'*' represents matching zero or more characters before the "*".
"a*" can either match "a" or "aa".
```
Demo2:
```text
Input:
s = "aab"
p = "c*a*b"
Output:
true
Explanation:
'c' and 'a' can appear zero time or more than one times.
```
Demo3:
```text
Input:
s = "ab"
p = ".*"
Output:
true
Explanation:
".*" means matching zero or more('*') arbitrary character('.').
```
### 1. Warm-up
The first step, let's ignore the regular symbols for a moment. If you're comparing two normal strings, how do you match them? I think this is an algorithm that anyone can write:
```cpp
bool isMatch(string text, string pattern) {
if (text.size() != pattern.size())
return false;
for (int j = 0; j < pattern.size(); j++) {
if (pattern[j] != text[j])
return false;
}
return true;
}
```
Then, I'm going to tweak the above code a little bit. It's a little more complicated, but it still means the same thing.
```cpp
bool isMatch(string text, string pattern) {
int i = 0; // Index position of text
int j = 0; // Index position of pattern
while (j < pattern.size()) {
if (i >= text.size())
return false;
if (pattern[j++] != text[i++])
return false;
}
// Equality means matching is complete
return j == text.size();
}
```
The above rewriting is to transform this algorithm into a recursive algorithm (pseudo code) :
```python
def isMatch(text, pattern) -> bool:
if pattern is empty: return (text is empty?)
first_match = (text not empty) and pattern[0] == text[0]
return first_match and isMatch(text[1:], pattern[1:])
```
If you can understand this code. Congratulations! your recursion idea is in place. The regular expression algorithm, though a little complicated, is actually based on this recursive code.
### 2. Handle the dot 「.」 wildcard
Dot can match any character. Very handy! In fact, it is the simplest with a little modification:
```python
def isMatch(text, pattern) -> bool:
if not pattern: return not text
first_match = bool(text) and pattern[0] in {text[0], '.'}
return first_match and isMatch(text[1:], pattern[1:])
```
### 3. Handle the 「*」 wildcard
The asterisk wildcard allows the previous character to be repeated any number of times, including zero. How many times? It's a little hard to say, but don't worry, we can at least build the framework further:
```python
def isMatch(text, pattern) -> bool:
if not pattern: return not text
first_match = bool(text) and pattern[0] in {text[0], '.'}
if len(pattern) >= 2 and pattern[1] == '*':
# find '*' wildcard
else:
return first_match and isMatch(text[1:], pattern[1:])
```
How many times does the character before the asterisk have to be repeated? It depends on the computation of a computer. Let's say, N times. As I've said many times before, the trick to writing recursion is to focus on the present issues and leave the rest to the recursion. Here, no matter what N is, there are only two choices: zero matches and one match. So you can do it this way:
```py
if len(pattern) >= 2 and pattern[1] == '*':
return isMatch(text, pattern[2:]) or \
first_match and isMatch(text[1:], pattern)
# explanation:
#if a character is found in combination with '*',
# or match the character 0 times, then skip the character and '*'
# or move the text when the pattern[0] matches the text[0]
```
As we can see, we keep the 「\*」 in the pattern and pushed the text backwards to implement the function of 「*」 to match the characters repeatedly for many times. A simple example will illustrate the logic. Suppose 'pattern = a*', 'text = aaa', we can draw a picture to see the matching process:
![example](../pictures/%E6%AD%A3%E5%88%99/regex_example.jpg)
At this point, the regular expression algorithm is almost complete.
### 4. Dynamic Programming
I chose to use the 「memo」 recursion to lower complexity. With the brute force method, the optimization process is as simple as using two variables I and j to record the current matching position, thus avoiding substring slices, and storing I and j in memos to avoid double counting.
I put the violent solution and the optimized solution together so that you can easily compare. You will find that the optimized solution is nothing more than the violent solution "translated", add a memo. That's all!
```py
# Recursion with memo
def isMatch(text, pattern) -> bool:
memo = dict() # memo
def dp(i, j):
if (i, j) in memo: return memo[(i, j)]
if j == len(pattern): return i == len(text)
first = i < len(text) and pattern[j] in {text[i], '.'}
if j <= len(pattern) - 2 and pattern[j + 1] == '*':
ans = dp(i, j + 2) or \
first and dp(i + 1, j)
else:
ans = first and dp(i + 1, j + 1)
memo[(i, j)] = ans
return ans
return dp(0, 0)
# brute force recursive
def isMatch(text, pattern) -> bool:
if not pattern: return not text
first = bool(text) and pattern[0] in {text[0], '.'}
if len(pattern) >= 2 and pattern[1] == '*':
return isMatch(text, pattern[2:]) or \
first and isMatch(text[1:], pattern)
else:
return first and isMatch(text[1:], pattern[1:])
```
**Some readers may ask, how do you know that this problem is a dynamic programming problem, how do you know that there is an overlapping subproblem? It's not easy to find that!**
**有的读者也许会问,你怎么知道这个问题是个动态规划问题呢,你怎么知道它就存在「重叠子问题」呢,这似乎不容易看出来呀?**
The clearest way is to answer this question is to assume an input and then draw a recursion tree. And you will definitely find the same node, which is a quantitative analysis. In fact, without so much trouble, let me teach you the qualitative analysis, at a glance can see the "overlapping sub-problem" property.
Taking the simplest Fibonacci sequence for example, we abstract the framework of recursive algorithm:
```py
def fib(n):
fib(n - 1) #1
fib(n - 2) #2
```
Look at the frame, how do I get from the original problem f(n) to the sub-problem f(n - 2)? There are two paths, one is f(n) -> #1 -> #1, and the other is f(n) -> #2. The former recurses twice; the latter recurse once. Two different computational paths but all face the same problem, which is called the "overlap subproblem". It is certain that **as long as you find a repeated path, there must be tens of thousands of such repeated paths, meaning that the huge quantum problem overlaps.**
**只要你发现一条重复路径,这样的重复路径一定存在千万条,意味着巨量子问题重叠。**
Similarly, for this problem, we still abstract the algorithm framework first:
```py
def dp(i, j):
dp(i, j + 2) #1
dp(i + 1, j) #2
dp(i + 1, j + 1) #3
```
A similar problem is raised. How can we reach the subproblem dp(i, j) from the original problem dp(i + 2, j + 2)? There are at least two paths, one is the dp(i, j) - > # 3 - > # 3, the second is the dp (i, j) - > # 1 - > #2 - > #2. Therefore, there must be overlapping subproblems in this problem, which need the optimization skills of dynamic programming to deal with.
### 5. Summary
In this article, you have gained a deep insight into the algorithmic implementation of two common wildcards for regular expressions. In fact, the implementation of the dot 「.」is very simple. The key is that the implementation of the asterisk 「*」 needs to use dynamic programming skills, a little more complex. But it breaks down under our analysis. In addition, you have developed a technique for quickly analyzing the nature of overlapping subproblems, allowing you to quickly determine whether a problem can be solved using dynamic programming.
Reviewing the whole process, you should be able to understand the process of algorithm design: from similar simple problems to the basic framework of the gradual assembly of new logic, eventually become a more complex, sophisticated algorithm. So, you guys don't be afraid of some more complex algorithm problems. No matter how big the algorithm in your eyes is just a piece of cake.
If this article is helpful to you, welcome to pay attention to my wechat official account **labuladong**, I'm committed to make the algorithm problem more clear ~

View File

@ -1,170 +0,0 @@
# 动态规划之正则表达
之前的文章「动态规划详解」收到了普遍的好评,今天写一个动态规划的实际应用:正则表达式。如果有读者对「动态规划」还不了解,建议先看一下上面那篇文章。
正则表达式匹配是一个很精妙的算法,而且难度也不小。本文主要写两个正则符号的算法实现:点号「.」和星号「*」,如果你用过正则表达式,应该明白他们的用法,不明白也没关系,等会会介绍。文章的最后,介绍了一种快速看出重叠子问题的技巧。
本文还有一个重要目的,就是教会读者如何设计算法。我们平时看别人的解法,直接看到一个面面俱到的完整答案,总觉得无法理解,以至觉得问题太难,自己太菜。我力求向读者展示,算法的设计是一个螺旋上升、逐步求精的过程,绝不是一步到位就能写出正确算法。本文会带你解决这个较为复杂的问题,让你明白如何化繁为简,逐个击破,从最简单的框架搭建出最终的答案。
前文无数次强调的框架思维,就是在这种设计过程中逐步培养的。下面进入正题,首先看一下题目:
![title](../pictures/%E6%AD%A3%E5%88%99/title.png)
### 一、热身
第一步,我们暂时不管正则符号,如果是两个普通的字符串进行比较,如何进行匹配?我想这个算法应该谁都会写:
```cpp
bool isMatch(string text, string pattern) {
if (text.size() != pattern.size())
return false;
for (int j = 0; j < pattern.size(); j++) {
if (pattern[j] != text[j])
return false;
}
return true;
}
```
然后,我稍微改造一下上面的代码,略微复杂了一点,但意思还是一样的,很容易理解吧:
```cpp
bool isMatch(string text, string pattern) {
int i = 0; // text 的索引位置
int j = 0; // pattern 的索引位置
while (j < pattern.size()) {
if (i >= text.size())
return false;
if (pattern[j++] != text[i++])
return false;
}
// 相等则说明完成匹配
return j == text.size();
}
```
如上改写,是为了将这个算法改造成递归算法(伪码):
```python
def isMatch(text, pattern) -> bool:
if pattern is empty: return (text is empty?)
first_match = (text not empty) and pattern[0] == text[0]
return first_match and isMatch(text[1:], pattern[1:])
```
如果你能够理解这段代码,恭喜你,你的递归思想已经到位,正则表达式算法虽然有点复杂,其实是基于这段递归代码逐步改造而成的。
### 二、处理点号「.」通配符
点号可以匹配任意一个字符,万金油嘛,其实是最简单的,稍加改造即可:
```python
def isMatch(text, pattern) -> bool:
if not pattern: return not text
first_match = bool(text) and pattern[0] in {text[0], '.'}
return first_match and isMatch(text[1:], pattern[1:])
```
### 三、处理「*」通配符
星号通配符可以让前一个字符重复任意次数,包括零次。那到底是重复几次呢?这似乎有点困难,不过不要着急,我们起码可以把框架的搭建再进一步:
```python
def isMatch(text, pattern) -> bool:
if not pattern: return not text
first_match = bool(text) and pattern[0] in {text[0], '.'}
if len(pattern) >= 2 and pattern[1] == '*':
# 发现 '*' 通配符
else:
return first_match and isMatch(text[1:], pattern[1:])
```
星号前面的那个字符到底要重复几次呢?这需要计算机暴力穷举来算,假设重复 N 次吧。前文多次强调过,写递归的技巧是管好当下,之后的事抛给递归。具体到这里,不管 N 是多少,当前的选择只有两个:匹配 0 次、匹配 1 次。所以可以这样处理:
```py
if len(pattern) >= 2 and pattern[1] == '*':
return isMatch(text, pattern[2:]) or \
first_match and isMatch(text[1:], pattern)
# 解释:如果发现有字符和 '*' 结合,
# 或者匹配该字符 0 次,然后跳过该字符和 '*'
# 或者当 pattern[0] 和 text[0] 匹配后,移动 text
```
可以看到,我们是通过保留 pattern 中的「\*」,同时向后推移 text来实现「*」将字符重复匹配多次的功能。举个简单的例子就能理解这个逻辑了。假设 `pattern = a*`, `text = aaa`,画个图看看匹配过程:
![example](../pictures/%E6%AD%A3%E5%88%99/example.png)
至此,正则表达式算法就基本完成了,
### 四、动态规划
我选择使用「备忘录」递归的方法来降低复杂度。有了暴力解法,优化的过程及其简单,就是使用两个变量 i, j 记录当前匹配到的位置,从而避免使用子字符串切片,并且将 i, j 存入备忘录,避免重复计算即可。
我将暴力解法和优化解法放在一起,方便你对比,你可以发现优化解法无非就是把暴力解法「翻译」了一遍,加了个 memo 作为备忘录,仅此而已。
```py
# 带备忘录的递归
def isMatch(text, pattern) -> bool:
memo = dict() # 备忘录
def dp(i, j):
if (i, j) in memo: return memo[(i, j)]
if j == len(pattern): return i == len(text)
first = i < len(text) and pattern[j] in {text[i], '.'}
if j <= len(pattern) - 2 and pattern[j + 1] == '*':
ans = dp(i, j + 2) or \
first and dp(i + 1, j)
else:
ans = first and dp(i + 1, j + 1)
memo[(i, j)] = ans
return ans
return dp(0, 0)
# 暴力递归
def isMatch(text, pattern) -> bool:
if not pattern: return not text
first = bool(text) and pattern[0] in {text[0], '.'}
if len(pattern) >= 2 and pattern[1] == '*':
return isMatch(text, pattern[2:]) or \
first and isMatch(text[1:], pattern)
else:
return first and isMatch(text[1:], pattern[1:])
```
**有的读者也许会问,你怎么知道这个问题是个动态规划问题呢,你怎么知道它就存在「重叠子问题」呢,这似乎不容易看出来呀?**
解答这个问题,最直观的应该是随便假设一个输入,然后画递归树,肯定是可以发现相同节点的。这属于定量分析,其实不用这么麻烦,下面我来教你定性分析,一眼就能看出「重叠子问题」性质。
先拿最简单的斐波那契数列举例,我们抽象出递归算法的框架:
```py
def fib(n):
fib(n - 1) #1
fib(n - 2) #2
```
看着这个框架,请问原问题 f(n) 如何触达子问题 f(n - 2) ?有两种路径,一是 f(n) -> #1 -> #1, 二是 f(n) -> #2。前者经过两次递归,后者进过一次递归而已。两条不同的计算路径都到达了同一个问题,这就是「重叠子问题」,而且可以肯定的是,**只要你发现一条重复路径,这样的重复路径一定存在千万条,意味着巨量子问题重叠。**
同理,对于本问题,我们依然先抽象出算法框架:
```py
def dp(i, j):
dp(i, j + 2) #1
dp(i + 1, j) #2
dp(i + 1, j + 1) #3
```
提出类似的问题,请问如何从原问题 dp(i, j) 触达子问题 dp(i + 2, j + 2) ?至少有两种路径,一是 dp(i, j) -> #3 -> #3,二是 dp(i, j) -> #1 -> #2 -> #2。因此,本问题一定存在重叠子问题,一定需要动态规划的优化技巧来处理。
### 五、最后总结
通过本文,你深入理解了正则表达式的两种常用通配符的算法实现。其实点号「.」的实现及其简单,关键是星号「*」的实现需要用到动态规划技巧,稍微复杂些,但是也架不住我们对问题的层层拆解,逐个击破。另外,你掌握了一种快速分析「重叠子问题」性质的技巧,可以快速判断一个问题是否可以使用动态规划套路解决。
回顾整个解题过程,你应该能够体会到算法设计的流程:从简单的类似问题入手,给基本的框架逐渐组装新的逻辑,最终成为一个比较复杂、精巧的算法。所以说,读者不必畏惧一些比较复杂的算法问题,多思考多类比,再高大上的算法在你眼里也不过一个脆皮。
如果本文对你有帮助,欢迎关注我的公众号 labuladong致力于把算法问题讲清楚

View File

Before

Width:  |  Height:  |  Size: 363 KiB

After

Width:  |  Height:  |  Size: 363 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 732 KiB

View File

Before

Width:  |  Height:  |  Size: 163 KiB

After

Width:  |  Height:  |  Size: 163 KiB