我正在尝试解析一个大约 1GB 的非常大的 XML 文件,它的格式是:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE candidates SYSTEM "dtd/mwetoolkit-candidates.dtd">
<!-- MWETOOLKIT: filetype="XML" -->
<candidates>
<meta>
<corpussize name="ukwac-01" value="38224449" />
<corpussize name="sum" value="38224449" />
</meta>
<cand candid="2">
<ngram><w lemma="executive" pos="JJ" ><freq name="ukwac-01" value="600" /><freq name="sum" value="600" /></w> <w lemma="box" pos="NNS" ><freq name="ukwac-01" value="1006" /><freq name="sum" value="1006" /></w> <freq name="ukwac-01" value="9" /><freq name="sum" value="9" /></ngram>
<occurs>
<ngram><w surface="Executive" lemma="executive" pos="JJ" /> <w surface="boxes" lemma="box" pos="NNS" /> <freq name="ukwac-01" value="1" /></ngram>
<ngram><w surface="executive" lemma="executive" pos="JJ" /> <w surface="boxes" lemma="box" pos="NNS" /> <freq name="ukwac-01" value="8" /></ngram>
</occurs>
</cand>
<cand candid="5">
<ngram><w lemma="bad" pos="JJ" ><freq name="ukwac-01" value="4094" /><freq name="sum" value="4094" /></w> <w lemma="thing" pos="NN" ><freq name="ukwac-01" value="6609" /><freq name="sum" value="6609" /></w> <freq name="ukwac-01" value="119" /><freq name="sum" value="119" /></ngram>
<occurs>
<ngram><w surface="bad" lemma="bad" pos="JJ" /> <w surface="thing" lemma="thing" pos="NN" /> <freq name="ukwac-01" value="115" /></ngram>
<ngram><w surface="Bad" lemma="bad" pos="JJ" /> <w surface="thing" lemma="thing" pos="NN" /> <freq name="ukwac-01" value="4" /></ngram>
</occurs>
</cand>
</candidates>
到目前为止,我有这个代码:
from lxml import etree
import sys
def fast_iter(context, func):
#http://www.ibm.com.br/developerworks/xml/library/x-hiperfparse/
#Author = Liza Daly
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def print_csv(element):
if element.tag == 'cand':
lemmas = []
compound_freqs = []
mweval = 0
for f in c.xpath('ngram/freq'):
if f.attrib['name'] == 'ukwac':
mweval = int(f.attrib['value'])
for w in element.xpath('ngram/w'):
lemmas.append(w.attrib['lemma'])
for freq in element.xpath('ngram/w/freq'):
if freq.attrib['name'] == 'ukwac':
compound_freqs.append(int(freq.attrib['value']))
print(' '.join(lemmas),mweval,sep='\t',end='\t')
[print(l,f,sep=":",end='') for l,f in zip(lemmas,compound_freqs)]
print()
if __name__ == '__main__':
args = sys.argv
context = etree.iterparse(args[1], events=("start", "end"))
print("mwe","mwe_freq","compounds",sep='\t')
for event, element in context:
if element.tag == "candidates":
fast_iter(context, print_csv)
所需的输出是 CSV 文件,格式为:
mwe mwe_freq compounds
executive box 9 executive:600,box:1006
确切的打印格式可能(并且将会)改变,但由于某种原因,一旦我进入打印功能并通过 element.tag 检查,freq 元素是空的,我打印的只是它们的地址。我知道我应该根据 iterparse 的文档在某处进行结束事件检查,但我尝试在 fast_iter 中放置一个,这肯定行不通。
我当前的输出:
mwe mwe_freq compounds
<Element freq at 0x7f8735342c48>
<Element freq at 0x7f8735342c88>
executive box 0
0
<Element freq at 0x7f8735346708>
<Element freq at 0x7f87353467c8>
bad thing 0
0
很感谢任何形式的帮助。