python - 摆脱 Python 正则表达式脚本中的可选空间

Question

我的正则表达式脚本有点问题，希望有人可以帮助我。

基本上，我有一个在 python 脚本中使用 re.findall() 的正则表达式脚本。我的目标是搜索不同长度的各种字符串以查找对圣经经文的引用（例如约翰福音 3:16、罗马书 6 等）。我的正则表达式脚本大部分都有效，但有时它会在圣经书名前添加一个额外的空格。这是脚本：

versesToFind = re.findall(r'\d?\s?\w+\s\d+:?\d*', str)

为了希望更好地解释这个问题，这是我在这个文本字符串上运行这个脚本时的结果：

str = 'testing testing John 3:16 adsfbaf John 2 1 Kings 4 Romans 4'

结果（来自 www.pythonregex.com）：

[u' John 3:16', u' John 2', u'1 Kings 4', u' Romans 4']

如您所见，约翰福音 2 和罗马书 4 在开头有一个额外的空格，我想去掉它。希望我的解释是有道理的。提前致谢！

score 1 · Accepted Answer

?:您可以通过用括号分组（只是为了指定它是非捕获的）将数字和空格作为一个单元作为一个可选单元，

'(?:\d\s)?\w+\s\d+:?\d*'
 ^^^    ^

哪个产生，

>>> s = 'testing testing John 3:16 adsfbaf John 2 1 Kings 4 Romans 4'
>>> re.findall(r'(?:\d\s)?\w+\s\d+:?\d*', s)
['John 3:16', 'John 2', '1 Kings 4', 'Romans 4']

score 0 · Accepted Answer

而不是重写你的正则表达式，你总是可以只strip()使用空格：

>>> L = [u' John 3:16', u' John 2', u'1 Kings 4', u' Romans 4']
>>> print map(unicode.strip, L)
[u'John 3:16', u'John 2', u'1 Kings 4', u'Romans 4']

map()这与以下内容相同：

>>> print [i.strip() for i in L]
[u'John 3:16', u'John 2', u'1 Kings 4', u'Romans 4']

score 0 · Accepted Answer

使用列表推导，您可以在一行中完成：

versesToFind = [x.strip() for x in re.findall(r'\d?\s?\w+\s\d+:?\d*', str)]

python - 摆脱 Python 正则表达式脚本中的可选空间

3 回答 3

Related

Reference