python - 使用正则表达式的字符串掩码和偏移量

Question

我有一个字符串，我尝试在其上创建一个正则表达式掩码，该掩码将N在给定偏移量的情况下显示单词数。假设我有以下字符串：

"The quick, brown fox jumps over the lazy dog."

我当时想显示 3 个单词：

偏移0："The quick, brown"
偏移1："quick, brown fox"
偏移2："brown fox jumps"
偏移：偏移3："fox jumps over"
偏移4："jumps over the"
偏移5："over the lazy"
偏移6："the lazy dog."

我正在使用 Python，并且一直在使用以下简单的正则表达式来检测 3 个单词：

>>> import re
>>> s = "The quick, brown fox jumps over the lazy dog."
>>> re.search(r'(\w+\W*){3}', s).group()
'The quick, brown '

但我不知道如何有一种面具来显示接下来的 3 个单词而不是开头的单词。我需要保留标点符号。

score 5 · Accepted Answer

前缀匹配选项

您可以通过使用可变前缀正则表达式跳过第一个offset单词并将单词三元组捕获到一个组中来完成这项工作。

所以是这样的：

import re
s = "The quick, brown fox jumps over the lazy dog."

print re.search(r'(?:\w+\W*){0}((?:\w+\W*){3})', s).group(1)
# The quick, brown 
print re.search(r'(?:\w+\W*){1}((?:\w+\W*){3})', s).group(1)
# quick, brown fox      
print re.search(r'(?:\w+\W*){2}((?:\w+\W*){3})', s).group(1)
# brown fox jumps

让我们看一下模式：

 _"word"_      _"word"_
/        \    /        \
(?:\w+\W*){2}((?:\w+\W*){3})
             \_____________/
                group 1

这就是它所说的：匹配2单词，然后捕获到第 1 组，匹配3单词。

这些(?:...)构造用于对重复进行分组，但它们不是捕获的。

参考

正则表达式.info/捕获组，非捕获组
- 重复捕获组与捕获重复组

关于“单词”模式的注意事项

应该说这\w+\W*对于“单词”模式来说是一个糟糕的选择，如下例所示：

import re
s = "nothing"
print re.search(r'(\w+\W*){3}', s).group()
# nothing

没有 3 个单词，但正则表达式无论如何都能匹配，因为\W*允许空字符串匹配。

也许更好的模式是这样的：

\w+(?:\W+|$)

也就是说，a\w+后跟 a\W+或字符串的结尾$。

捕获前瞻选项

正如 Kobi 在评论中所建议的那样，此选项更简单，因为您只有一个静态模式。它用于findall捕获所有匹配项（参见 ideone.com）：

import re
s = "The quick, brown fox jumps over the lazy dog."

triplets = re.findall(r"\b(?=((?:\w+(?:\W+|$)){3}))", s)

print triplets
# ['The quick, brown ', 'quick, brown fox ', 'brown fox jumps ',
#  'fox jumps over ', 'jumps over the ', 'over the lazy ', 'the lazy dog.']

print triplets[3]
# fox jumps over

它的工作原理是它匹配零宽度单词边界\b，使用前瞻捕获组 1 中的 3 个“单词”。

    ______lookahead______
   /      ___"word"__    \
  /      /           \    \
\b(?=((?:\w+(?:\W+|$)){3}))
     \___________________/
           group 1

参考

正则表达式.info/Lookarounds

score 2 · Accepted Answer

一种倾向是拆分字符串并选择切片：

words = re.split(r"\s+", s)
for i in range(len(words) - 2):
    print ' '.join(words[i:i+3])

当然，这确实假设您在单词之间只有一个空格，或者不在乎所有空格序列是否都折叠成单个空格。

score 1 · Accepted Answer

不需要正则表达式

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> for offset in range(7):
...     print 'offset {0}: "{1}"'.format(offset, ' '.join(s.split()[offset:][:3]))
... 
offset 0: "The quick, brown"
offset 1: "quick, brown fox"
offset 2: "brown fox jumps"
offset 3: "fox jumps over"
offset 4: "jumps over the"
offset 5: "over the lazy"
offset 6: "the lazy dog."

score 1 · Accepted Answer

我们这里有两个正交问题：

如何拆分字符串。
如何构建 3 个连续元素的组。

对于 1，您可以使用正则表达式或 - 正如其他人指出的那样 - 一个简单的str.split 就足够了。对于 2，请注意，您希望看起来与itertools配方中的pairwise抽象非常相似：

http://docs.python.org/library/itertools.html#recipes

所以我们写了我们的广义n-wise函数：

import itertools

def nwise(iterable, n):
    """nwise(iter([1,2,3,4,5]), 3) -> (1,2,3), (2,3,4), (4,5,6)"""
    iterables = itertools.tee(iterable, n)
    slices = (itertools.islice(it, idx, None) for (idx, it) in enumerate(iterables))
    return itertools.izip(*slices)

我们最终得到了一个简单的模块化代码：

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> list(nwise(s.split(), 3))
[('The', 'quick,', 'brown'), ('quick,', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog.')]

或按您的要求：

>>> # also: map(" ".join, nwise(s.split(), 3))
>>> [" ".join(words) for words in nwise(s.split(), 3)]
['The quick, brown', 'quick, brown fox', 'brown fox jumps', 'fox jumps over', 'jumps over the', 'over the lazy', 'the lazy dog.']

python - 使用正则表达式的字符串掩码和偏移量

4 回答 4

前缀匹配选项

参考

关于“单词”模式的注意事项

捕获前瞻选项

参考

Related