正则表达式

2021-12-03

正则表达式

正则表达式（简称为“regex”），允许用户使用他们能想到的、几乎任何类型的规则来搜索字符串。例如，查找字符串中的所有大写字母，或查找文档中的电话号码。

正则表达式因其看似奇怪的语法而臭名昭著。这种奇怪的语法是其灵活性的副产品。正则表达式必须能够过滤掉可以想象的任何字符串模式，这就是为什么它们具有复杂的字符串模式的格式。

我们使用 Python 内置的 re 库来处理正则表达式。若要了解更多信息，请参见官方文档的有关内容。

搜索基本模式

假设有以下字符串：

1	text = "The agent's phone number is 408-555-1234. Call soon!"

如果搜索字符串 'phone' 是否在上述文本中，快速的方法是：

1	'phone' in text

它将返回 True ，因为 text 中有这个字符串。

上面的操作，如果用正则表达式来实现，则为：

1
2
3

import re
pattern = 'phone'
re.search(pattern,text)

输出

1	<_sre.SRE_Match object; span=(12, 17), match='phone'>

这个结果表示 'phone' 与变量 text 的字符串匹配，并且与之对应的是该字符串的索引 12 到到 17 间的成员。

再比如：

1 2	pattern = "NOT IN TEXT" re.search(pattern,text)

不会返回任何内容，因为找不到匹配项。

由此可知，re.search() 的作用是扫描文本，然后返回匹配对象。如果没有找到该模式，则返回None 。

对于返回的匹配对象（Match object），如果用变量 match 引用，则 match.span() 返回含有匹配对象的索引范围，match.start() 返回开始索引，match.end() 返回结束索引。

>>> match = re.search(pattern, text)
>>> match
<re.Match object; span=(12, 17), match='phone'>
>>> match.span()
(12, 17)
>>> match.start()
12
>>> match.end()
17

如果在被搜索的文本中，有多个匹配对象，结果如何？

>>> text = "my phone is a new phone"
>>> match = re.search("phone",text)
>>> match
<re.Match object; span=(3, 8), match='phone'>

这里只返回了第一个符合条件的匹配项。

为了将所有匹配项都得到，可以使用 .findall() 方法：

1
2
3

>>> matches = re.findall("phone",text)
>>> matches
['phone', 'phone']

如果想要与之匹配的实际文本，可以使用 .group() 方法。

1 2	>>> match.group() 'phone'

复杂模式

前面用简单的正则表达式演示了 re 中有关函数的基本使用，下面研究如何编写复杂的正则表达式。

在正则表达式中，数字或单个字符串等可以用不同的编码来表示，用这些编码可以构建一个“模式字符串”（pattern string）。请注意，在模式字符串中会大量使用反斜杠 \ 。因此，在 Python 中，常常用原始字符串的形式定义模式字符串，样式为：

1	r'mypattern'

在原始字符串中，模式字符串中的 \ 就不再具有转义符的含义了。

下面的表格中可以找到所有可能的标识符：

表1

符号	含义	举例	实例
\d	数字	file_\d\d	file_25
\w	字母数字	\w-\w\w\w	A-b_1
\s	空白	a\sb\sc	a b c
\D	一个非数字	\D\D\D	ABC
\W	非字母数字	\W\W\W\W\W	*-+=)
\S	非空格	\S\S\S\S	Yoyo

不用特别记忆，用到时来查找即可。

请看下面的代码：

>>> text = "My telephone number is 408-555-1234"
>>> phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)
>>> phone.group()
'408-555-1234'

注意重复的 \d 。这有点麻烦，特别是如果要寻找很长的数字串，就得使用量词。

表2

字符	描述	示例模式代码	示例匹配
+	出现一次或一次以上	Version \w-\w+	Version A-b1_1
{3}	正好出现3次	\D{3}	abc
{2,4}	出现2到4次	\d{2,4}	123
{3,}	出现3次或3次以上	\w{3,}	anycharacters
*	出现零次或多次	ABC*	AAACC
?	一次或零次	plurals？	plurals

用量词修改前面的正则表达式：

1
2
3

>>> phone = re.search(r'\d{3}-\d{3}-\d{4}',text)
>>> phone.group()
'408-555-1234'

结果没变，但是它看起来很简单，可以很容易地用于复杂的和大型的图案。

分组

继续以上面的电话号码为例，是由三个数字一组组成的，如果要每三个数字作为一个单位来搜索，则需要在正则表达式中分组，分组方法就是使用 () 实现，例如：

1
2
3

>>> phone = re.search(r'(\d{3})-(\d{3})-(\d{4})',text)
>>> phone.group()
'408-555-1234'

注意上面正则表达式的写法，当执行 phone.group() ，返回的是所有匹配结果。

>>> phone.group(1)
'408'
>>> phone.group(2)
'555'
>>> phone.group(3)
'1234'

当其中的参数依次为 1 、2 、3 时，返回对应分组的匹配结果。但是，注意，如果参数是 0 ，则返回全部。

1 2	>>> phone.group(0) '408-555-1234'

或运算

正则表达式中使用使用管道操作符实现或运算，例如：

1 2	>>> re.search(r"man\|woman","This man was here.") <re.Match object; span=(5, 8), match='man'>

对比：

1 2	>>> re.search(r"man\|woman","This woman was here.") <re.Match object; span=(5, 10), match='woman'>

通配符

正在表达式中的通配符用“.”表示，如：

1 2	>>> re.findall(r".at","The cat in the hat sat here.") ['cat', 'hat', 'sat']

如果只需要匹配前三个字母，则：

1 2	>>> re.findall(r"...at","The bat went splat") ['e bat', 'splat']

注意观察上面的结果，因为空格也是字符，并且 r'...at' 匹配的规则是在 'at' 前面有三个字符。

如果要匹配所有以 'at' 结尾的单词，怎么办？

1
2
3

# 一个或多个非空格之后是 'at'
>>> re.findall(r'\S+at',"The bat went splat")
['bat', 'splat']

起止符

在正则表达式中，使用 ^ 在字符串的开头查找某字符，使用 $ 在字符串的尾部查找某字符，例如：

# 查找尾部的整数字符
>>> re.findall(r'\d$','This ends with a number 2')
['2']

# 查找开头的整数字符
>>> re.findall(r'^\d','1 is the loneliest number.')
['1']

上面的代码将 ['2'] 、['1'] 作为字符串的结束字符和开始字符返回。

请注意，这适用于整个字符串，而不是单个单词。

从字符串中删除指定字符

要从字符串中删除指定类型的字符，可以将 ^ 符号与一组括号 [] 结合使用。括号内的任何内容都被筛选去掉。例如：

1
2
3

>>> phrase = "there are 3 numbers 34 inside 5 this sentence."
>>> re.findall(r'[^\d]',phrase)
['t', 'h', 'e', 'r', 'e', ' ', 'a', 'r', 'e', ' ', ' ', 'n', 'u', 'm', 'b', 'e', 'r', 's', ' ', ' ', 'i', 'n', 's', 'i', 'd', 'e', ' ', ' ', 't', 'h', 'i', 's', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', '.']

删除标点符号

使用与上面类似的方法，可以删除字符串中的标点符号

1
2
3

>>> test_phrase = 'This is a string! But it has punctuation. How can we remove it?'
>>> re.findall('[^!.? ]+',test_phrase)
['This', 'is', 'a', 'string', 'But', 'it', 'has', 'punctuation', 'How', 'can', 'we', 'remove', 'it']

分组的括号

使用 [ ] 也能够实现分组选择，比如下面的字符串中，找出含有连字符的单词。

1
2
3

>>> text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'
>>> re.findall(r'[\w]+-[\w]+',text)
['hypen-words', 'long-ish']

用于多个选项的括号

如果有多个匹配选项，可以使用括号列出这些选项。例如，从下面的字符串中宣传以 cat 开头，并且以 fish 或 nap 或 claw 结尾的单词。

1
2
3

>>> text = 'Hello, would you like some catfish?'
>>> re.search(r'cat(fish|nap|claw)',text)
<re.Match object; span=(27, 34), match='catfish'>

结论

正则表达式不论的软件开发、WEB编程、还是机器学习的数据清洗中，都有很多用途。不过它的内容庞杂，需要耐心地、认真地研究细节。

参考资料

齐伟. Python 大学实用教程. 电子工业出版社
https://towardsdatascience.com/manipulate-pdf-files-extract-information-with-pypdf2-and-regular-expression-39ff697db0ca

← 用 Python 编辑 PDF 文件自动驾驶中的车道识别 →

赏

使用支付宝打赏

使用微信打赏

若你觉得我的文章对你有帮助，欢迎点击上方按钮对我打赏

关注微信公众号，读文章、听课程，提升技能