python - 如何解析大型 xml 文件中的一些数据？

Question

我需要从格式如下的大型 xml 文件中提取位置和半径数据，并将数据存储在二维 ndarray 中。这是我第一次使用 Python，我找不到任何关于最好的方法来做到这一点。

<species name="MyHeterotrophEPS" header="family,genealogy,generation,birthday,biomass,inert,capsule,growthRate,volumeRate,locationX,locationY,locationZ,radius,totalRadius">
0,0,0,0.0,0.0,0.0,77.0645361927206,-0.1001871531330136,-0.0013358287084401814,4.523853439106942,234.14575280979898,123.92820420047076,0.0,0.6259920275663835;
0,0,0,0.0,0.0,0.0,108.5705297969604,-0.1411462759900182,-0.001881950346533576,1.0429122163754276,144.1066875513379,72.24884428367467,0.0,0.7017581019907897;
.
.
.
</species>

编辑：按人类标准，我的意思是“大”。我没有任何记忆问题。

score 4 · Accepted Answer

您基本上在 XML 文本值中有 CSV 数据。

用于ElementTree解析 XML，然后用于numpy.genfromtxt()将该文本加载到数组中：

from xml.etree import ElementTree as ET

tree = ET.parse('yourxmlfilename.xml')
species = tree.find(".//species[@name='MyHeterotrophEPS']")
names = species.attrib['header']
array = numpy.genfromtxt((line.rstrip(';') for line in species.text.splitlines()), 
    delimiter=',', names=names)

注意生成器表达式，带有str.splitlines()调用；这会将 XML 元素的文本转换为一系列行，这.genfromtxt()很容易接收。我们确实;从每一行中删除了尾随字符。

对于您的示例输入（减去.行），这将导致：

array([ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 77.0645361927206, -0.1001871531330136, -0.0013358287084401814, 4.523853439106942, 234.14575280979898, 123.92820420047076, 0.0, 0.6259920275663835),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 108.5705297969604, -0.1411462759900182, -0.001881950346533576, 1.0429122163754276, 144.1066875513379, 72.24884428367467, 0.0, 0.7017581019907897)], 
      dtype=[('family', '<f8'), ('genealogy', '<f8'), ('generation', '<f8'), ('birthday', '<f8'), ('biomass', '<f8'), ('inert', '<f8'), ('capsule', '<f8'), ('growthRate', '<f8'), ('volumeRate', '<f8'), ('locationX', '<f8'), ('locationY', '<f8'), ('locationZ', '<f8'), ('radius', '<f8'), ('totalRadius', '<f8')])

score 2 · Accepted Answer

如果您的 XML 只是那个species节点，那非常简单，Martijn Pieters 已经比我解释得更好了。

但是，如果您species在文档中有大量节点，并且它太大而无法将整个内容放入内存中，您可以使用iterparse代替parse：

import numpy as np
import xml.etree.ElementTree as ET

for event, node in ET.iterparse('species.xml'):
    if node.tag == 'species':
        name = node.attr['name']
        names = node.attr['header']
        csvdata = (line.rstrip(';') for line in node.text.splitlines())
        array = np.genfromtxt(csvdata, delimiter=',', names=names)
        # do something with the array.

如果您只有一个超大节点，这将无济于事species，因为即使iterparse（或类似的解决方案，如 SAX 解析器）一次也解析一个完整的节点。您需要找到一个 XML 库，它可以让您流式传输大型节点的文本，并且在我的脑海中，我认为没有任何 stdlib 或流行的第三方解析器可以做到这一点。

score 0 · Accepted Answer

如果文件非常大，请使用ElementTree或SAX。

如果文件不是那么大（即适合内存），minidom可能更容易使用。

每行似乎都是一个简单的逗号分隔数字字符串，所以你可以简单地做line.split(',').

python - 如何解析大型 xml 文件中的一些数据？

3 回答 3

Related

Reference