python - 在 python 中使用 re 查找维基百科转储中的内部链接

Question

转储文本中的样本是 -

s='[[Pierre-Joseph Proudhon|Proudhon]], [[Peter Kropotkin|Kropotkin]], [[Mikhail Bakunin|Bakunin]]'

当我运行给出的正则表达式时 -

match_internal=re.findall('\[\[(.+)\]\]',s)
for i in match_internal:
    print i
>>Pierre-Joseph Proudhon|Proudhon]], [[Peter Kropotkin|Kropotkin]], [[Mikhail Bakunin|Bakunin

而不是打印

Pierre-Joseph Proudhon|Proudhon
Peter Kropotkin|Kropotkin
Mikhail Bakunin|Bakunin

score 4 · Accepted Answer

您需要使用不情愿的量词而不是贪婪的量词：-

re.findall('\[\[(.+?)\]\]',s)  // Replaced `.+` with `.+?`

使用贪婪的量词-您的模式-(.+)将匹配所有字符串直到最后一个]]，而使用不情愿的量词-模式-(.+?)将在第一个停止]]。

>>> match_internal=re.findall('\[\[(.+?)\]\]',s)
>>> for i in match_internal:
        print i

Pierre-Joseph Proudhon|Proudhon
Peter Kropotkin|Kropotkin
Mikhail Bakunin|Bakunin

score 1 · Accepted Answer

默认情况下，+量词尽可能匹配。而且由于.匹配字符串中的所有字符，因此只有一个匹配项，仅不包括最外层的括号。

您应该在括号内搜索“非括号”字符，如下所示：

re.findall('\[\[([^\]]+)\]\]', s)

python - 在 python 中使用 re 查找维基百科转储中的内部链接

2 回答 2

Related

Reference