python - iterparse 无法解析字段，而其他类似的都可以

Question

我使用 Pythoniterparse来解析 nessus 扫描的 XML 结果（.nessus 文件）。解析在意外记录上失败，但类似的记录已被正确解析。

XML 文件的一般结构是很多记录，如下所示：

<ReportHost>
  <ReportItem>
    <foo>9.3</foo>
    <bar>hello</bar>
  </ReportItem>
  <ReportItem>
     <foo>10.0</foo>
     <bar>world</bar>
</ReportHost>
<ReportHost>
   ...
</ReportHost>

换句话说，很多主机（ReportHost）有很多要报告的项目（ReportItem），而后者有几个特征（foo，bar）。我将着眼于为每个项目生成一个具有其特征的行。

解析在文件中间的一行中失败（foo在这种情况下cvss_base_score）

<cvss_base_score>9.3</cvss_base_score>

而大约 200 条类似的行已被解析而没有问题。

相关的代码如下——它设置了上下文标记（inReportHost它inReportEvent告诉我我所在的 XML 文件的具体位置，并根据上下文分配或打印一个值）

import xml.etree.cElementTree as ET
inReportHost = False
inReportItem = False

for event, elem in ET.iterparse("test2.nessus", events=("start", "end")):
    if event == 'start' and elem.tag == "ReportHost":
        inReportHost = True
    if event == 'end' and elem.tag == "ReportHost":
        inReportHost = False
        elem.clear()
    if inReportHost:
        if event == 'start' and elem.tag == 'ReportItem':
            inReportItem = True
            cvss = ''
        if event == 'start' and inReportItem:
            if event == 'start' and elem.tag == 'cvss_base_score':
                cvss = elem.text
        if event == 'end' and elem.tag == 'ReportItem':
            print cvss
            inReportItem = False

cvss有时具有 None 值（在cvss = elem.text分配之后），即使相同的条目已在文件中较早地被正确解析。

如果我在分配下面添加一些类似的东西

if cvss is None: cvss = "0"

然后解析许多进一步cvss分配它们的正确值（而其他一些是无）。

当采取<ReportHost>...</reportHost>which 导致错误的解析并通过程序运行它时 - 它工作正常（即按预期cvss分配）。9.3

我迷失在我的代码中出现错误的地方，因为有大量相似的记录，有些是正确处理的，有些是不正确的（有些记录是相同的，但处理方式仍然不同）。我也找不到有关失败记录的任何具体信息-早晚相同的记录都可以。

score 4 · Accepted Answer

从iterparse() 文档：

注意：iterparse() 只保证它在发出“start”事件时看到了起始标记的“>”字符，因此定义了属性，但此时 text 和 tail 属性的内容是未定义的。这同样适用于子元素；它们可能存在也可能不存在。如果您需要一个完全填充的元素，请寻找“结束”事件。

完全解析后，仅在“结束”事件上删除inReport*变量并处理 ReportHost。使用 ElementTree API 获取必要的信息，例如cvss_base_score从当前的 ReportHost 元素。

要保留内存，请执行以下操作：

import xml.etree.cElementTree as etree

def getelements(filename_or_file, tag):
    context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # preserve memory

for host in getelements("test2.nessus", "ReportHost"):
    for cvss_el in host.iter("cvss_base_score"):
        print(cvss_el.text)

python - iterparse 无法解析字段，而其他类似的都可以

1 回答 1

Related

Reference