0

我正在尝试解析一个大约 1GB 的非常大的 XML 文件,它的格式是:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE candidates SYSTEM "dtd/mwetoolkit-candidates.dtd">
<!-- MWETOOLKIT: filetype="XML" -->
<candidates>
<meta>
    <corpussize name="ukwac-01" value="38224449" />
    <corpussize name="sum" value="38224449" />
</meta>
<cand candid="2">
    <ngram><w lemma="executive" pos="JJ" ><freq name="ukwac-01" value="600" /><freq name="sum" value="600" /></w> <w lemma="box" pos="NNS" ><freq name="ukwac-01" value="1006" /><freq name="sum" value="1006" /></w> <freq name="ukwac-01" value="9" /><freq name="sum" value="9" /></ngram>
    <occurs>
    <ngram><w surface="Executive" lemma="executive" pos="JJ" /> <w surface="boxes" lemma="box" pos="NNS" /> <freq name="ukwac-01" value="1" /></ngram>
    <ngram><w surface="executive" lemma="executive" pos="JJ" /> <w surface="boxes" lemma="box" pos="NNS" /> <freq name="ukwac-01" value="8" /></ngram>
    </occurs>
</cand>
<cand candid="5">
    <ngram><w lemma="bad" pos="JJ" ><freq name="ukwac-01" value="4094" /><freq name="sum" value="4094" /></w> <w lemma="thing" pos="NN" ><freq name="ukwac-01" value="6609" /><freq name="sum" value="6609" /></w> <freq name="ukwac-01" value="119" /><freq name="sum" value="119" /></ngram>
    <occurs>
    <ngram><w surface="bad" lemma="bad" pos="JJ" /> <w surface="thing" lemma="thing" pos="NN" /> <freq name="ukwac-01" value="115" /></ngram>
    <ngram><w surface="Bad" lemma="bad" pos="JJ" /> <w surface="thing" lemma="thing" pos="NN" /> <freq name="ukwac-01" value="4" /></ngram>
    </occurs>
</cand>
</candidates>

到目前为止,我有这个代码:

from lxml import etree
import sys

def fast_iter(context, func):
    #http://www.ibm.com.br/developerworks/xml/library/x-hiperfparse/
    #Author = Liza Daly
    for event, elem in context:       
        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context


def print_csv(element):
    if element.tag == 'cand':
        lemmas = []
        compound_freqs = []
        mweval = 0
        for f in c.xpath('ngram/freq'):
            if f.attrib['name'] == 'ukwac':
                mweval = int(f.attrib['value'])
        for w in element.xpath('ngram/w'):
            lemmas.append(w.attrib['lemma'])
        for freq in element.xpath('ngram/w/freq'):
            if freq.attrib['name'] == 'ukwac':
                compound_freqs.append(int(freq.attrib['value']))
        print(' '.join(lemmas),mweval,sep='\t',end='\t')
        [print(l,f,sep=":",end='') for l,f in zip(lemmas,compound_freqs)]
        print()


if __name__ == '__main__':
    args = sys.argv
    context = etree.iterparse(args[1], events=("start", "end"))
    print("mwe","mwe_freq","compounds",sep='\t')
    for event, element in context:
        if element.tag == "candidates":
            fast_iter(context, print_csv)

所需的输出是 CSV 文件,格式为:

mwe        mwe_freq    compounds
executive box    9    executive:600,box:1006

确切的打印格式可能(并且将会)改变,但由于某种原因,一旦我进入打印功能并通过 element.tag 检查,freq 元素是空的,我打印的只是它们的地址。我知道我应该根据 iterparse 的文档在某处进行结束事件检查,但我尝试在 fast_iter 中放置一个,这肯定行不通。

我当前的输出:

mwe     mwe_freq        compounds
<Element freq at 0x7f8735342c48>
<Element freq at 0x7f8735342c88>
executive box   0
        0
<Element freq at 0x7f8735346708>
<Element freq at 0x7f87353467c8>
bad thing       0
        0

很感谢任何形式的帮助。

4

0 回答 0