python - 从python中的xml文件中提取信息

Question

我想从几个 xml 文件中提取信息，如下所示：

https://github.com/peldszus/arg-microtexts/blob/master/corpus/en/micro_b001.xml

我只想提取此标签信息：

<arggraph id="micro_b001" topic_id="waste_separation" stance="pro">

这是：“micro_b001”“waste_separation”

我想将它们保存为列表

我试过这个：

myList = []  
myEdgesList=[]
#read the whole text from 
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.xml'):
            with open(os.path.join(root, file), encoding="UTF-8") as content:
                tree = ET.parse(content)
                myList.append(tree)

上面的代码是正确的，它给出了每个文件的信息

<xml.etree.ElementTree.ElementTree at 0x21c893e34c0>,

但这看起来不正确

for k in myList:
    arg= [e.attrib['stance'] for e in k.findall('.//arggraph')  ]
    print(arg)

第二个代码没有给我所需的值

score 0 · Accepted Answer

处理此问题的一种方法：

from lxml import etree
tree = etree.parse(myfile.xml)
for graph in tree.xpath('//arggraph'):
    print(graph.xpath('@id')[0])
    print(graph.xpath('@topic_id')[0])

输出：

micro_b001
waste_separation

score 0 · Accepted Answer

另一种方法。

import os
from simplified_scrapy import SimplifiedDoc, utils

path = 'test'
#read the whole text from 
myList = []
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.xml'):
            myList.append(os.path.join(root, file))

for file in myList:
    xml = utils.getFileContent(file)
    doc = SimplifiedDoc(xml)
    arg = [(e['stance'],e['id'],e['topic_id']) for e in doc.selects('arggraph')]
    print (arg)

结果：

[('pro', 'micro_b001', 'waste_separation')]

python - 从python中的xml文件中提取信息

2 回答 2

Related

Reference