python - 用正则表达式替换单词的某些部分

Question

你如何删除里面<ref> *some text*</ref>的文字和ref它本身？

在'...and so on<ref>Oxford University Press</ref>.'

re.sub(r'<ref>.+</ref>', '', string)<ref>仅删除<ref>后跟空格的情况

编辑：我猜它与单词边界有关......或者？

EDIT2我需要的是，</ref>即使它在换行符上，它也会计算最后一个（关闭）。

score 3 · Accepted Answer

我真的没有看到你的问题，因为粘贴的代码会删除<ref>...</ref>字符串的一部分。但是，如果您的意思是这样，并且没有删除空的 ref 标记：

re.sub(r'<ref>.+</ref>', '', '...and so on<ref></ref>.')

然后你需要做的是用 .* 改变 .+

A + 表示一个或多个，而 * 表示零或多个。

从http://docs.python.org/library/re.html：

'.' (Dot.) In the default mode, this matches any character except a newline.
    If the DOTALL flag has been specified, this matches any character including
    a newline.
'*' Causes the resulting RE to match 0 or more repetitions of the preceding
    RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’
    followed by any number of ‘b’s.
'+' Causes the resulting RE to match 1 or more repetitions of the preceding
    RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
    not match just ‘a’.
'?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
    ab? will match either ‘a’ or ‘ab’.

score 1 · Accepted Answer

您可能要小心，不要仅仅因为有多个关闭</ref>s 就删除大量文本。在我看来，下面的正则表达式会更准确：

r'<ref>[^<]*</ref>'

这将防止“贪婪”匹配。

顺便说一句：有一个很棒的工具叫做 The Regex Coach 来分析和测试你的正则表达式。您可以在以下网址找到它：http ://www.weitz.de/regex-coach/

编辑：忘记在第一段中添加代码标签。

score 1 · Accepted Answer

你可以制作一个花哨的正则表达式来做你想做的事情，但是你需要使用 DOTALL 和非贪婪搜索，并且你需要了解正则表达式的一般工作原理，而你不需要。

您最好的选择是使用字符串方法而不是正则表达式，无论如何这更像是pythonic：

while '<reg>' in string:
    begin, end = string.split('<reg>', 1)
    trash, end = end.split('</reg>', 1)
    string = begin + end

如果您想要非常通用，允许标签或标签中的空格和属性的奇怪大写，您也不应该这样做，而是投资学习 html/xml 解析库。lxml目前似乎被广泛推荐并得到很好的支持。

score 0 · Accepted Answer

如果您尝试使用正则表达式来执行此操作，您将遇到麻烦。您正在有效地尝试解析某些内容，但您的解析器无法胜任这项任务。

跨字符串贪婪匹配可能会消耗太多，如下例所示：

<ref>SDD</ref>...<ref>XX</ref>

你最终会清理整个中间。

你真的想要一个解析器，比如Beautiful Soup。

from BeautifulSoup import BeautifulSoup, Tag
s = "<a>sfsdf</a> <ref>XX</ref> || <ref>YY</ref>"
soup = BeautifulSoup(s)
x = soup.findAll("ref")
for z in x:
  soup.ref.replaceWith('!')
soup # <a>sfsdf</a> ! || !

python - 用正则表达式替换单词的某些部分

4 回答 4

Related

Reference