python - 如何使用python忽略许多XML文件中的标签

Question

我有很多 xml 文件，其中包含很多文本。这段文字我需要小写并删除标点符号。但我不知道如何使用 python 说我希望它忽略所有标签。

我找到了一个名为 ElementTree 的 xml 解析器，并且我有一个正则表达式来查找标签： pattern = re.compile ('<[^<]*?>')

我对其进行了测试，它只给了我第一个标签中的文本（有很多标签名为）。为什么？

我在一个字符串中进行测试以进行不同的测试以获取所有标签：

text = "<root> <test>aaaaaaa </test> <test2> bbbbbbbbb </test2> </root> <root> <test3> cccccc </test3> <test4> ddddd </test4> </root>"
pattern = re.compile ('<[^<]*?>')
tmp = pattern.findall(content, re.DOTALL)

它给了我：

['</test>', '<test2>', '</test2>', '</root>', '<root>', '<test3>', '</test3>', '<test4>', '</test4>', '</root>']

为什么不<root> <test>呢？

score 7 · Accepted Answer

您实际上似乎并没有使用 ElementTree。

这是一个如何使用 ElementTree 的示例

import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()

您可以使用递归通过一个函数运行所有标签来清理它们：

def clean_tag(tag):
    for child in tag:
        clean_tag(child)
    if tag.text != None:
        # add your code to do lowercase and punctuation here
        tag.text = tag.text.lower()

clean_tag(tree.getroot())
clean_xml = ET.tostring(tree)

python - 如何使用python忽略许多XML文件中的标签

1 回答 1

Related

Reference