我有一个 xml 文件,其中包含要放入 tar.gz 文件(扁平化)的文件的目录结构。
我应该如何解析 xml 以提取每个文件的路径?
现在我正在使用 lxml 并找到这样的路径:
paths = []
for case in root.iter('case'):
for language in case.iter('language'):
for result in language.iter('result'):
for file in result.iter('file'):
paths.append('/'.join([node.get('id') for node in [case, language, result, file]]))
但这感觉有点太硬编码了,如果结构发生变化,它就不能很好地工作。
我可以使用 root.iter('file') 找到每个文件节点,但是如何获取每个节点/文件的所有父/目录?还是我应该以(完全?)不同的方式做到这一点?
xml 看起来像这样:
<?xml version="1.0" encoding="UTF-8"?>
<files batch="regular">
<case id="case_10_some_description">
<language id="english">
<result id="images">
<file id="screenshot_1.png"/>
<file id="screenshot_2.png"/>
<file id="screenshot_3.png"/>
<file id="screenshot_4.png"/>
<file id="screenshot_5.png"/>
<file id="screenshot_6.png"/>
</result>
</language>
</case>
<case id="case_12_some_description">
<language id="english">
<result id="images">
<file id="screenshot_1.png"/>
<file id="screenshot_2.png"/>
<file id="screenshot_3.png"/>
</result>
</language>
</case>
</files>
这是文件:
regular/case_10_some_description/english/images/screenshot_1.png
regular/case_10_some_description/english/images/screenshot_2.png
regular/case_10_some_description/english/images/screenshot_3.png
regular/case_10_some_description/english/images/screenshot_4.png
regular/case_10_some_description/english/images/screenshot_5.png
regular/case_10_some_description/english/images/screenshot_6.png
regular/case_12_some_description/english/images/screenshot_1.png
regular/case_12_some_description/english/images/screenshot_2.png
regular/case_12_some_description/english/images/screenshot_3.png