在 MongoNYC 2013 会议上,一位演讲者提到他们使用维基百科的副本来测试他们的全文搜索。我自己尝试过复制它,但由于文件大小和格式的原因,我发现它并不简单。
这就是我正在做的事情:
$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2
$ python
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('enwiki-latest-pages-articles.xml')
Killed
当我尝试使用标准 XML 解析器对其进行解析时,Python 会在 xml 文件的大小上出错。有没有人对如何将 9GB XML 文件转换为我可以加载到 mongoDB 的 JSON-y 文件有任何其他建议?
更新 1
按照肖恩的建议,我也尝试了迭代元素树:
>>> import xml.etree.ElementTree as ET
>>> context = ET.iterparse('enwiki-latest-pages-articles.xml', events=("start", "end"))
>>> context = iter(context)
>>> event, root = context.next()
>>> for i in context[0:10]:
... print(i)
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_IterParseIterator' object has no attribute '__getitem__'
>>> for event, elem in context[0:10]:
... if event == "end" and elem.tag == "record":
... print(elem)
... root.clear()
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_IterParseIterator' object has no attribute '__getitem__'
同样,没有运气。
更新 2
在下面跟进 Asya Kamsky 的建议。
这是尝试xml2json
:
$ git clone https://github.com/hay/xml2json.git
$ ./xml2json/xml2json.py -t xml2json -o enwiki-latest-pages-articles.json enwiki-latest-pages-articles.xml
Traceback (most recent call last):
File "./xml2json/xml2json.py", line 199, in <module>
main()
File "./xml2json/xml2json.py", line 181, in main
input = open(arguments[0]).read()
MemoryError
这是xmlutils
:
$ pip install xmlutils
$ xml2json --input "enwiki-latest-pages-articles.xml" --output "enwiki-latest-pages-articles.json"
xml2sql by Kailash Nadh (http://nadh.in)
--help for help
Wrote to enwiki-latest-pages-articles.json
但内容只是一条记录。它没有迭代。
xmltodict
,看起来也很有希望,因为它使用迭代的 Expat 做广告并且对维基百科有好处。但它也在 20 分钟左右后内存不足:
>>> import xmltodict
>>> f = open('enwiki-latest-pages-articles.xml')
>>> doc = xmltodict.parse(f)
Killed
更新 3
这是对罗斯下面的回答的回应,根据他提到的链接对我的解析器进行建模:
from lxml import etree
file = 'enwiki-latest-pages-articles.xml'
def page_handler(page):
try:
print page.get('title','').encode('utf-8')
except:
print page
print "error"
class page_handler(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True if tag == 'title' else False
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text.append(data.encode('utf-8'))
def close(self):
return self.text
def fast_iter(context, func):
for event, elem in context:
print(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
process_element = etree.XMLParser(target = page_handler())
context = etree.iterparse( file, tag='item' )
fast_iter(context,process_element)
错误是:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in fast_iter
File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:112653)
File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:113223)
File "parser.pxi", line 596, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83186)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 22, column 1