-1

我想从几个 xml 文件中提取信息,如下所示: 在此处输入图像描述

https://github.com/peldszus/arg-microtexts/blob/master/corpus/en/micro_b001.xml

我只想提取此标签信息:

<arggraph id="micro_b001" topic_id="waste_separation" stance="pro">

这是:“micro_b001”“waste_separation”

我想将它们保存为列表

我试过这个:

myList = []  
myEdgesList=[]
#read the whole text from 
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.xml'):
            with open(os.path.join(root, file), encoding="UTF-8") as content:
                tree = ET.parse(content)
                myList.append(tree)

上面的代码是正确的,它给出了每个文件的信息

<xml.etree.ElementTree.ElementTree at 0x21c893e34c0>,

但这看起来不正确

for k in myList:
    arg= [e.attrib['stance'] for e in k.findall('.//arggraph')  ]
    print(arg)

第二个代码没有给我所需的值

4

2 回答 2

0

处理此问题的一种方法:

from lxml import etree
tree = etree.parse(myfile.xml)
for graph in tree.xpath('//arggraph'):
    print(graph.xpath('@id')[0])
    print(graph.xpath('@topic_id')[0])

输出:

micro_b001
waste_separation
于 2020-10-21T22:03:19.397 回答
0

另一种方法。

import os
from simplified_scrapy import SimplifiedDoc, utils

path = 'test'
#read the whole text from 
myList = []
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.xml'):
            myList.append(os.path.join(root, file))

for file in myList:
    xml = utils.getFileContent(file)
    doc = SimplifiedDoc(xml)
    arg = [(e['stance'],e['id'],e['topic_id']) for e in doc.selects('arggraph')]
    print (arg)

结果:

[('pro', 'micro_b001', 'waste_separation')]
于 2020-10-26T01:37:41.530 回答