python - Python重新捕获每个单词的一个匹配项

Question

我需要在文本文档中查找价格。我的代码如下所示：

sentence = "This is test text $25,000 $25,000$20,000 $30"
pattern = re.compile(ur'[$€£]?\d+([.,]\d+)?', re.UNICODE | re.MULTILINE | re.DOTALL)
print pattern.findall(sentence)

期望的结果是：

['$25,000', '$30']

我不需要在结果中包含 $25,000$20,000，因为这不是我的任务的有效结果。我只需要完整的单词匹配。

但我得到了这个结果：

['$25,000', '$25,000', '$20,000', '$30']

如何重写我的正则表达式以仅包含由空格或标点符号分隔的价格？

score 1 · Accepted Answer

这是我所能得到的最接近的（尽管有很多人比我拥有更多的正则表达式技能）：

pattern = re.compile(ur'(?:^|\s)[$€£]?\d+(?:[.,]\d+)?(?=\s|$)', re.UNICODE | re.MULTILINE | re.DOTALL)
print pattern.findall(sentence) # [' $25,000', ' $30']

score 1 · Accepted Answer

尝试以下操作：

ur'(?<!\S)[€£$]?\d+(?:[.,]\d+)?(?!\S)'

我添加了否定断言(?<!\S)，(?!\S)分别表示“如果前面有一个非空格则不匹配”和“如果后面跟着一个非空格则不匹配”。

测试：

>>> sentence = "$1234 $56$78.90 This is test text $25,000 $25,000$20,000 $30"
>>> pattern = re.compile(ur'(?<!\S)[€£$]?\d+(?:[.,]\d+)?(?!\S)', re.UNICODE | re.MULTILINE | re.DOTALL)
>>> print pattern.findall(sentence)
['$1234', '$25,000', '$30']

如果您想在匹配之前或之后允许某些非空格字符，请替换\S为[^\s<chars>]您<chars>要允许的字符。例子：

ur'(?<![^\s:])[€£$]?\d+(?:[.,]\d+)?(?![^\s,.])'

允许模式前面有 a:后面有,or .：

>>> sentence = "$1234 $56$78.90 This is test text:$25,000. $45. $25,000$20,000 $30"
>>> print pattern.findall(sentence)
['$1234', '$25,000', '$45', '$30']

python - Python重新捕获每个单词的一个匹配项

2 回答 2

Related

Reference