regex - Is special treatment required for within in Python's re.sub?

Question

Example:

This has desired effect:

Replace the following with blank:

<tag condition="MyCondition">Text</tag>

Via:

string = re.sub('<tag condition=\"MyCondition\">.+</tag>', '', string)

But consider the following:

<tag2 condition="myCondition2">
<tag>Text</tag> and <tag>text</tag> is here.
</tag2>

And that I want to replace tag2 and all contents with blank eg:

string = re.sub('<tag2 condition=\"myCondition2\">.+</tag2>', '', string)

It is not removing tag2 and contents and I think it might be because there are <tags> within tag2.

How do I replace tag2 and all contents with blank?

score 1 · Accepted Answer

一旦你克服了简单的情况，正则表达式就会成为你的敌人。只需使用适当的 XML 解析器解析 XML，修改解析树，然后将其打印出来：

import lxml.etree

xml = '''
    <?xml version="1.0" encoding="UTF-8" ?>
    <root>
        <tag condition="MyCondition">Text</tag>

        <tag3>Don't touch me</tag3>

        <tag2 condition="myCondition2">
            <tag>Text</tag> and <tag>text</tag> is here.
        </tag2>
    </root>
'''

tree = lxml.etree.fromstring(xml.strip())

for element in tree.xpath('//tag[@condition="MyCondition"] | //tag2[@condition="myCondition2"]'):
    element.getparent().remove(element)

print(lxml.etree.tostring(tree, pretty_print=True))

score -1 · Accepted Answer

You are missing the re.DOTALL flag. Without it, your regular expression fails to match the newlines. tag2 would be fine, as you can see when you try to apply your expression to the (almost) equivalent

<tag2 condition="myCondition2"><tag>Text</tag> and <tag>text</tag> is here.</tag2>

regex - Is special treatment required for within in Python's re.sub?

2 回答 2

Related

Reference