python - Python regex：匹配所有连续的大写单词

Question

简短的问题：

我有一个字符串：

title="Announcing Elasticsearch.js For Node.js And The Browser"

我想找到每个单词都正确大写的所有单词对。

所以，预期的输出应该是：

['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']

我现在拥有的是这样的：

'[A-Z][a-z]+[\s-][A-Z][a-z.]*'

这给了我输出：

['Announcing Elasticsearch.js', 'For Node.js', 'And The']

如何更改我的正则表达式以提供所需的输出？

score 2 · Accepted Answer

你可以使用这个：

#!/usr/bin/python
import re

title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'

print re.findall(pattern, title)

“正常”模式无法匹配重叠的子字符串，所有字符都是一次性创建的。然而，前瞻(?=..)（即“跟随”）只是一个检查并且不匹配。它可以多次解析字符串。因此，如果您在前瞻中放置一个捕获组，您可以获得重叠的子字符串。

score 0 · Accepted Answer

可能有更有效的方法来做到这一点，但你可以使用这样的正则表达式：

(\b[A-Z][a-z.-]+\b)

然后遍历捕获组，像这样使用这个正则表达式进行测试：(^[A-Z][a-z.-]+$)以确保匹配的组（当前）匹配匹配的组（下一个）。

工作示例：

import re

title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
    for i in range(len(m)):
        if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
            matchlist.append([m[i - 1], m[i]])

print matchlist

输出：

[
    ['Browser', 'Announcing'], 
    ['Announcing', 'Elasticsearch.js'], 
    ['Elasticsearch.js', 'For'], 
    ['For', 'Node.js'], 
    ['Node.js', 'And'], 
    ['And', 'The'], 
    ['The', 'Browser']
]

score 0 · Accepted Answer

如果你现在的 Python 代码是这样的

title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)

那么你的程序正在跳过奇数对。一个简单的解决方案是在跳过第一个单词后研究模式，如下所示：

m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)

现在只需将 results 和 result2 组合在一起。

python - Python regex：匹配所有连续的大写单词

3 回答 3

Related

Reference