5

我有一个维基百科转储,正在努力寻找合适的正则表达式模式来删除表达式中的双方括号。以下是表达式的示例:

line = 'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the [[herbicide]]s and [[defoliant]]s used by the [[United States armed forces|U.S. military]] as part of its [[herbicidal warfare]] program, [[Operation Ranch Hand]], during the [[Vietnam War]] from 1961 to 1971.'

我希望在以下条件下删除所有方括号:

  • 如果方括号内没有垂直分隔符,请删除括号。

    例子:[[herbicide]]s变成herbicides

  • 如果括号内有垂直分隔符,请删除括号并仅使用分隔符后面的短语。

    例子:[[United States armed forces|U.S. military]]变成U.S. military

我尝试使用re.matchre.search无法达到所需的输出。

感谢您的帮助!

4

3 回答 3

11

你需要的是re.sub. 请注意,方括号和管道都是元字符,因此需要对其进行转义。

re.sub(r'\[\[(?:[^\]|]*\|)?([^\]|]*)\]\]', r'\1', line)

替换字符串中的\1是指括号内匹配的内容,以开头?:(即在任何情况下都是您想要的文本)。

有两个警告。这允许在打开和关闭支架之间只有一个管道。如果有多个,则需要指定是想要第一个之后的所有内容还是最后一个之后的所有内容。另一个需要注意的是,]在左括号和右括号之间是不允许的。如果这是一个问题,仍然会有一个正则表达式解决方案,但它会复杂得多。

有关该模式的完整说明:

\[\[        # match two literal [
(?:         # start optional non-capturing subpattern for pre-| text
   [^\]|]   # this looks a bit confusing but it is a negated character class
            # allowing any character except for ] and |
   *        # zero or more of those
   \|       # a literal |
)?          # end of subpattern; make it optional
(           # start of capturing group 1 - the text you want to keep
    [^\]|]* # the same character class as above
)           # end of capturing group
\]\]        # match two literal ]
于 2012-11-30T19:53:15.753 回答
3

您可以使用re.sub来查找[[and之间的所有内容]],我认为传入 lambda 函数来进行替换会稍微容易一些(从最后一个 '|' 开始获取所有内容)

>>> import re
>>> re.sub(r'\[\[(.*?)\]\]', lambda L: L.group(1).rsplit('|', 1)[-1], line)
'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the herbicides and defoliants used by the U.S. military as part of its herbicidal warfare program, Operation Ranch Hand, during the Vietnam War from 1961 to 1971.'
于 2012-11-30T19:58:12.577 回答
2
>>> import re
>>> re.sub(r'\[\[(?:[^|\]]*\|)?([^\]]*)]]', r'\1', line)
'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the herbicides and defoliants used by the U.S. military as part of its herbicidal warfare program, Operation Ranch Hand, during the Vietnam War from 1961 to 1971.'

解释:

\[\[       # match two opening square brackets
(?:        # start optional non-capturing group
   [^|\]]*   # match any number of characters that are not '|' or ']'
   \|        # match a '|'
)?         # end optional non-capturing group
(          # start capture group 1
   [^\]]*    # match any number of characters that are not ']'
)          # end capture group 1
]]         # match two closing square brackets

通过将上述正则表达式的匹配项替换为捕获组 1 的内容,您将获得方括号的内容,但只有分隔符之后的内容(如果存在)。

于 2012-11-30T19:57:32.717 回答