python - 使用 Python ElementTree 提取 XML 标记中的文本

Question

我有一个包含数万个 XML 文件（小型文件）的语料库，我正在尝试使用 Python 并提取其中一个 XML 标记中包含的文本，例如，body 标记之间的所有内容，例如：

<body> sample text here with <bold> nested </bold> tags in this paragraph </body>

然后编写一个包含此字符串的文本文档，然后在 XML 文件列表中向下移动。

我正在使用 effbot 的 ELementTree，但找不到正确的命令/语法来执行此操作。我找到了一个使用 miniDOM 的 dom.getElementsByTagName 的网站，但我不确定 ElementTree 的相应方法是什么。任何想法将不胜感激。

score 2 · Accepted Answer

一个更好的答案，展示了如何实际使用 XML 解析来做到这一点：

import xml.etree.ElementTree as ET
stringofxml = "<body> sample text here with <bold> nested </bold> tags in this paragraph </body>"

def extractTextFromElement(elementName, stringofxml):
    tree = ET.fromstring(stringofxml)
    for child in tree:
        if child.tag == elementName:
            return child.text.strip()

print extractTextFromElement('bold', stringofxml)

score 1 · Accepted Answer

我只会使用 re：

import re
body_txt = re.match('<body>(.*)</body>',body_txt).groups()[0]

然后删除内部标签：

body_txt = re.sub('<.*?>','',body_txt)

你不应该在不需要的时候使用正则表达式，这是真的……但是在需要的时候使用它们并没有错。

python - 使用 Python ElementTree 提取 XML 标记中的文本

2 回答 2

Related

Reference