我尝试了这里提到的一些命令......
https://tech.marksblogg.com/working-with-data-feeds.html
但是 xmltodict 模块似乎没有按预期工作:
wget https://dumps.wikimedia.org/enwiki/20210801/enwiki-20210801-pages-articles2.xml-p41243p151573.bz2
bunzip2 enwiki-20210801-pages-articles2.xml-p41243p151573.bz2
git clone https://github.com/martinblech/xmltodict.git
cat enwiki-20210801-pages-articles2.xml-p41243p151573 | xmltodict/xmltodict.py 2 > save.txt
有没有其他方法可以将 XML 转换为 python dict?
我已检查以下内容是否按预期工作:
# python
Python 3.9.5 (default, May 12 2021, 14:30:06)
[GCC 8.3.0] on linux
>>> import xmltodict
>>> xml = """<DECL>!! आप की सेवा में पुनः पधारे !!</DECL>"""
>>> xmltodict.parse(xml, process_namespaces=True)
OrderedDict([('DECL', '!! आप की सेवा में पुनः पधारे !!')])
但它不适用于上述文件,可能是因为文件太大。
我尝试了上述教程中提到的类似命令。
# cat enwiki-20210801-pages-articles2.xml-p41243p151573 | xmltodict/xmltodict.py 2 | python /tmp/dump_pages.py
Traceback (most recent call last):
File "/tmp/dump_pages.py", line 7, in <module>
_, page = marshal.load(sys.stdin)
TypeError: file.read() returned not bytes but str
Traceback (most recent call last):
File "/tmp/stack/xmltodict/xmltodict.py", line 533, in <module>
root = parse(stdin,
File "/tmp/stack/xmltodict/xmltodict.py", line 368, in parse
parser.ParseFile(xml_input)
File "/usr/src/python/Modules/pyexpat.c", line 461, in EndElement
File "/tmp/stack/xmltodict/xmltodict.py", line 132, in endElement
should_continue = self.item_callback(self.path, item)
File "/tmp/stack/xmltodict/xmltodict.py", line 529, in handle_item
marshal.dump((path, item), stdout)
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
转储文件内容:
# cat /tmp/dump_pages.py
import json
import marshal
import sys
while True:
try:
_, page = marshal.load(sys.stdin)
print (json.dumps(page))
except EOFError:
break
我只是想将维基百科的 XML 转储转换为 CSV。(仅限某些列)