python - Python元素树 - 从元素中提取文本，剥离标签

Question

使用 Python 中的 ElementTree，如何从节点中提取所有文本，剥离该元素中的任何标签并仅保留文本？

例如，假设我有以下内容：

<tag>
  Some <a>example</a> text
</tag>

我想回来Some example text。我该怎么做呢？到目前为止，我所采取的方法已经产生了相当灾难性的结果。

score 22 · Accepted Answer

如果您在 Python 3.2+ 下运行，则可以使用itertext.

itertext创建一个文本迭代器，它按文档顺序循环此元素和所有子元素，并返回所有内部文本：

import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))

# -> 'Some example text'

如果您在较低版本的 Python 中运行，则可以通过将其附加到类来重用实现itertext()Element，之后您可以像上面一样调用它：

# original implementation of .itertext() for Python 2.7
def itertext(self):
    tag = self.tag
    if not isinstance(tag, basestring) and tag is not None:
        return
    if self.text:
        yield self.text
    for e in self:
        for s in e.itertext():
            yield s
        if e.tail:
            yield e.tail

# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
    ET.Element.itertext = itertext

xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))

# -> 'Some example text'

score 5 · Accepted Answer

正如文档所说，如果您只想阅读文本，而不需要任何中间标签，则必须以正确的顺序递归连接所有属性text和tail属性。

但是，最新版本（包括 2.7 和 3.2 中的 stdlib 中的版本，但不包括 2.6 或 3.1，以及PyPIElementTree和lxmlPyPI 上的当前发布版本）可以在以下方法中自动为您执行此操作tostring：

>>> s = '''<tag>
...   Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'\n  Some example text\n'

如果您还想从文本中去除空格，则需要手动执行此操作。在您的简单情况下，这很容易：

>>> ElementTree.tostring(s, method='text').strip()
'Some example text'

然而，在更复杂的情况下，如果您想去除中间标记中的空白，您可能不得不退回到递归处理texts 和tails。这并不难。你只需要记住处理属性可能是的可能性None。例如，这里有一个框架，您可以将自己的代码挂在上面：

def textify(t):
    s = []
    if t.text:
        s.append(t.text)
    for child in t.getchildren():
        s.extend(textify(child))
    if t.tail:
        s.append(t.tail)
    return ''.join(s)

此版本仅在text并且tail保证为 astr或None. 对于您手动构建的树，这不能保证是真的。

score 0 · Accepted Answer

Aslo 存在一个非常简单的解决方案，以防可以使用 XPath。它被称为 XPath Axes：更多关于它的信息可以在这里找到。

当有一个节点（如 tag ）本身包含文本和其他div节点（如 tagsa或other ），其中包含文本或仅包含文本并且我们想要选择该节点中的所有文本时，可以使用遵循 XPath ：. 我们将得到一个当前元素中所有文本的列表，如果有的话，去掉里面的标签。centerdivdivcurrent_element.xpath("descendant-or-self::*/text()").extract()

它的好处是不需要递归函数，XPath 会处理所有这些（使用递归本身，但对我们来说它是尽可能干净的）。

这是有关此建议解决方案的 StackOverflow 问题。

python - Python元素树 - 从元素中提取文本，剥离标签

3 回答 3

Related

Reference