4

我正在尝试通过iterparse()(设计为)太大而无法放入内存的 XML 文档进行增量解析。我发现即使对文档进行无操作传递也会耗尽进程内存并导致我的系统开始交换。

期望xml.etree.ElementTree.iterparse()在独立于 XML 文档大小的恒定内存中运行是错误的吗?如果是这样,对任意长的 XML 文档进行增量解析的推荐包是什么?如果不是,我的代码有问题吗?

这是代码:请注意,我仅请求“开始”事件(因此解析器在返回文档根元素的结束标记(在我的情况下为 <osm>)之前不会尝试缓冲所有正文元素。我明确表示del()循环变量以强制它们被释放。

考虑到垃圾收集器可能没有机会运行,因为循环没有产生,我添加了对每百万次迭代gc.collect()的 显式调用。time.sleep()但这无济于事。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import xml.etree.ElementTree as ET
import pprint

import gc
import time
import os
import psutil

def gcStats(myProc):
    # return human readable gc.stats for 3 generations

    extmem = myProc.memory_info_ex()

    a =  "extmem: rss {:12n}, vms {:12n}, shared{:12n}, text{:12n}, lib {:12n}, data{:12n}, dirty{:12n}".format(
                    extmem.rss, extmem.vms, extmem.shared, extmem.text, extmem.lib, extmem.data, extmem.dirty)

    return a + "\tgc enabled {}, sumCount {:n}, lenGarbage {:n}".format( gc.isenabled(), sum(gc.get_count()), len(gc.garbage)) 

# the misbehaving code:    
def count_tags(filename):
    retVal = {}
    iterCount = 0
    sleepTime = 2.0
    myProc = psutil.Process()

    print("Starting: gc.isenabled() == {}\n{}".format(gc.isenabled(), gcStats( myProc)))

    for event, element in ET.iterparse(filename, ('start',)):
        assert event == 'start'
        if iterCount % 1000000 == 0:
            print('{} iterations, sleeping {} sec...'.format(iterCount, sleepTime))
            time.sleep( sleepTime)
            print('{}\nNow starting gc pass...'.format( gcStats( myProc)))
            gcr = gc.collect()
            print('gc returned {}'.format( gcr))
        iterCount += 1
        del element
        del event
    return retVal


if __name__ == "__main__":
    tags = count_tags('/home/bobhy/MOOC_Data/' + 'chicago.osm')

这是文档的示例。它是格式良好的 OSM 数据。

<?xml version='1.0' encoding='UTF-8'?>
<osm version="0.6" generator="Osmosis 0.43.1">
  <bounds minlon="-88.50500" minlat="41.33900" maxlon="-87.06600" maxlat="42.29700" origin="http://www.openstreetmap.org/api/0.6"/>
  <node id="219850" version="54" timestamp="2011-04-06T05:17:15Z" uid="207745" user="NE2" changeset="7781188" lat="41.7585879" lon="-87.9101245">
    <tag k="exit_to" v="Joliet Road"/>
    <tag k="highway" v="motorway_junction"/>
    <tag k="ref" v="276C"/>
  </node>
  <node id="219851" version="47" timestamp="2011-04-06T05:18:47Z" uid="207745" user="NE2" changeset="7781188" lat="41.7593116" lon="-87.9076432">
    <tag k="exit_to" v="North I-294 ; Tri-State Tollway;  Wisconsin"/>
    <tag k="highway" v="motorway_junction"/>
    <tag k="ref" v="277A"/>
  </node>
  <node id="219871" version="1" timestamp="2006-04-15T00:34:03Z" uid="229" user="LA2" changeset="3725" lat="41.932278" lon="-87.9179332"/>
  <node id="700724" version="14" timestamp="2009-04-13T11:21:51Z" uid="18480" user="nickvet419" changeset="485405" lat="41.7120272" lon="-88.0158606"/>

. . . 等等 1.8 GB 。. .

  <relation id="3366425" version="1" timestamp="2013-12-07T21:37:35Z" uid="239998" user="Sundance" changeset="19330301">
    <member type="way" ref="250651738" role="outer"/>
    <member type="way" ref="250651748" role="inner"/>
    <tag k="type" v="multipolygon"/>
  </relation>
  <relation id="3378994" version="1" timestamp="2013-12-14T22:24:26Z" uid="371121" user="AndrewSnow" changeset="19456337">
    <member type="way" ref="251850076" role="outer"/>
    <member type="way" ref="251850073" role="inner"/>
    <member type="way" ref="251850074" role="inner"/>
    <member type="way" ref="251850075" role="inner"/>
    <tag k="type" v="multipolygon"/>
  </relation>
  <relation id="3382796" version="1" timestamp="2013-12-17T03:21:18Z" uid="567034" user="Umbugbene" changeset="19492258">
    <member type="way" ref="252225400" role="outer"/>
    <member type="way" ref="252225404" role="inner"/>
    <tag k="type" v="multipolygon"/>
  </relation>
</osm>

这是输出:

Starting: gc.isenabled() == True
extmem: rss      9097216, vms     37199872, shared     3145728, text     3301376, lib            0, data     5820416, dirty           0 gc enabled True, sumCount 410, lenGarbage 0
0 iterations, sleeping 2.0 sec...
extmem: rss      9097216, vms     37335040, shared     3145728, text     3301376, lib            0, data     5955584, dirty           0 gc enabled True, sumCount 87, lenGarbage 0
Now starting gc pass...
gc returned 0
1000000 iterations, sleeping 2.0 sec...
extmem: rss   1234309120, vms   1262891008, shared     3280896, text     3301376, lib            0, data  1231511552, dirty           0 gc enabled True, sumCount 372, lenGarbage 0
Now starting gc pass...
gc returned 0
2000000 iterations, sleeping 2.0 sec...
extmem: rss   2495262720, vms   2524073984, shared     3280896, text     3301376, lib            0, data  2492694528, dirty           0 gc enabled True, sumCount 37, lenGarbage 0
Now starting gc pass...
gc returned 0
3000000 iterations, sleeping 2.0 sec...
extmem: rss   3781947392, vms   3812208640, shared     3280896, text     3301376, lib            0, data  3780829184, dirty           0 gc enabled True, sumCount 262, lenGarbage 0
Now starting gc pass...
gc returned 0
4000000 iterations, sleeping 2.0 sec...
extmem: rss   5067837440, vms   5096787968, shared     3280896, text     3301376, lib            0, data  5065408512, dirty           0 gc enabled True, sumCount 241, lenGarbage 0
Now starting gc pass...
gc returned 0
5000000 iterations, sleeping 2.0 sec...
extmem: rss   6345998336, vms   6375632896, shared     3063808, text     3301376, lib            0, data  6344253440, dirty           0 gc enabled True, sumCount 333, lenGarbage 0
Now starting gc pass...
gc returned 0
6000000 iterations, sleeping 2.0 sec...
extmem: rss   7266795520, vms   7665147904, shared     1060864, text     3301376, lib            0, data  7633768448, dirty           0 gc enabled True, sumCount 877, lenGarbage 0
Now starting gc pass...

我解释输出以显示进程虚拟内存以大约 1 000 B / 迭代的速度增长(即,每个解析的 XML 标记)。我认为垃圾收集统计数据没有显示分配对象的单调增加,所以我不知道内存增长来自哪里。确实启用了垃圾收集。

4

2 回答 2

3

您需要通过调用该方法来显式清除不再需要的元素element.clear(),否则它仍然会在内存中徘徊。这意味着您可能还想监听'end'事件并在clear()到达封装元素的末尾时调用,您知道不再需要它的任何内容。

于 2015-02-09T17:29:00.460 回答
1

仔细阅读文档iterparse()让我相信以上是预期的行为。该文档说它返回一个成熟的元素,对子访问没有限制,因此它必须在内存中保留(增量增长的)文档树。

由于我的问题不需要父或子元素访问,只需要遇到每个标签的事件,我就可以用xml.etree.ElementTree.XMLParser()很好地解决我的问题。

于 2014-04-21T02:28:18.913 回答