1

我有一个 xml 文件,其中包含要放入 tar.gz 文件(扁平化)的文件的目录结构。

我应该如何解析 xml 以提取每个文件的路径?

现在我正在使用 lxml 并找到这样的路径:

paths = []
for case in root.iter('case'):
    for language in case.iter('language'):
        for result in language.iter('result'):
            for file in result.iter('file'):
                paths.append('/'.join([node.get('id') for node in [case, language, result, file]]))

但这感觉有点太硬编码了,如果结构发生变化,它就不能很好地工作。

我可以使用 root.iter('file') 找到每个文件节点,但是如何获取每个节点/文件的所有父/目录?还是我应该以(完全?)不同的方式做到这一点?

xml 看起来像这样:

<?xml version="1.0" encoding="UTF-8"?>
<files batch="regular">
    <case id="case_10_some_description">
        <language id="english">
            <result id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
                <file id="screenshot_4.png"/>
                <file id="screenshot_5.png"/>
                <file id="screenshot_6.png"/>
            </result>
        </language>
    </case>
    <case id="case_12_some_description">
        <language id="english">
            <result id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
            </result>
        </language>
    </case>
</files>

这是文件:

regular/case_10_some_description/english/images/screenshot_1.png
regular/case_10_some_description/english/images/screenshot_2.png
regular/case_10_some_description/english/images/screenshot_3.png
regular/case_10_some_description/english/images/screenshot_4.png
regular/case_10_some_description/english/images/screenshot_5.png
regular/case_10_some_description/english/images/screenshot_6.png
regular/case_12_some_description/english/images/screenshot_1.png
regular/case_12_some_description/english/images/screenshot_2.png
regular/case_12_some_description/english/images/screenshot_3.png
4

2 回答 2

1

您是否自己创建此文件模式?如果你能改变它,我肯定会。尝试做这样的事情:

<?xml version="1.0" encoding="UTF-8"?>
<Directory id="regular">
    <Directory id="case_10_some_description">
        <Directory id="english">
            <Directory id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
                <file id="screenshot_4.png"/>
                <file id="screenshot_5.png"/>
                <file id="screenshot_6.png"/>
            </Directory>
        </Directory>
    </Directory>
    <Directory id="case_12_some_description">
        <Directory id="english">
            <Directory id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
            </Directory>
        </Directory>
    </Directory>
</Directory>

如果它们具有相同的含义,请始终为标签赋予相同的名称。也许使用比标签更多不同的属性,这会让你的解析更容易

于 2013-09-04T09:15:22.447 回答
0
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for file in root.iter('file'):
    print 'regular/case_10_some_description/english/images/'+file.attrib['id']
于 2013-09-04T10:30:41.600 回答