python - 元素树（Python）itertext 不使用换行符

Question

作为我最近发布的一个后续问题...

我正在使用 ElementTree 进行一些 XML 解析，并且在 Python 中有以下方法：

def extract_all_text(element):
  "".join(element.itertext())

这样做的目的是从元素中提取文本，剥离元素中包含任何文本的任何标签。ėg.,extract_all_text(ElementTree.fromstring('<a>B <c>D</c></a>'))应该返回B D. 但是，尝试将此方法与包含换行符的文件中的元素一起使用时，我遇到了一个奇怪的错误。错误如下所示：

File "/home/Intredasting/foo.py", line 74, in bar
  description = extract_all_text(root.find('description')).strip()
File "/home/Intredasting/foo.py", line 62, in extract_all_text
  return "".join(element.itertext())
TypeError: sequence item 0: expected str instance, list found

如果我运行ElementTree.dump(root.find('description'))，它显示了我试图解析的 XML 元素，我会得到：

<description>
  Foo <a href="http://example.com">bar</a>.
</description>

如果我通过编辑文件来删除换行符，使元素看起来像这样：

<description>Foo <a href="http://example.com">bar</a>.</description>

然后该方法完美运行，我得到了Foo bar.. 为什么会这样？如何使该方法与换行符一起使用？

编辑：

你可以看到我在这里使用的特定文件（我将它缩减为一个简单的版本，但它仍然会导致错误）：http ://www.filedropper.com/example_1

要测试此文件，请尝试

$ python3
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('/path/to/example.xml')
>>> desc = tree.getroot().find('description')
>>> print("".join(desc.itertext()))

（这应该会产生错误。）

另一个编辑：

此代码提供了对正在发生的事情的额外洞察（除了上面的代码之外运行它）

>>> for text in desc.itertext(): print(text)
['\n', '    Foo ']
bar
['.', '\n', '  ']

当然，我可以通过简单地将这些列表连接成一个字符串来解决这个问题。但我觉得这要么是 ElementTree 的错误，要么是输入文件有问题，或者我的 Python 版本搞砸了。

score 0 · Accepted Answer

无法使用 Python 2.7.5 和 ElementTree 1.3.0 复制您的结果

In [1]: import xml.etree.ElementTree as ET

In [2]: ET.VERSION
Out[2]: '1.3.0'

In [3]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:el = ET.fromstring("""<description>
:        Foo <a href="http://example.com">bar</a>.
:    </description>""")
:--

In [4]: "".join(el.itertext())
Out[4]: '\n        Foo bar.\n    '

您使用的是什么版本的 Python 和 ElementTree？如果您使用的是 Python 3.3+，它可能与此错误http://bugs.python.org/issue16913（已在 3.3.1 中修复）有关

编辑

我在 Python 3.3.2+ 中尝试了您的代码（print 应该是函数 btw）并且无法重现该错误，但 Python 3.3.0 给出了相同的错误消息。我会说这是一个 ElementTree 问题。

python - 元素树（Python）itertext 不使用换行符

1 回答 1

Related

Reference