python - 在 lxml 中测试元素时避免循环

Question

我有这个问题，我正在使用 lxml 处理一些表 - 原始源文件是 mhtml 格式，它们是 excel 文件。我需要找到包含标题元素“th”元素的行。我想使用标题元素，但需要它们来自的行以确保我按顺序处理所有内容。

所以我一直在做的是找到所有 th 元素，然后从那些使用 e.getparent() 函数的元素中获取行（因为 th 是一行的子元素）。但是我最终不得不两次提取第 th 个元素，一次是找到它们并获取行，然后再次将它们从行中取出来解析我正在寻找的数据。这不是最好的方法，所以我想知道我是否遗漏了一些东西。

这是我的代码

from lxml import html
theString=unicode(open('c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls').read(),'UTF-8','replace')
theTree=html.fromstring(theString)
tables=[e for e in theTree.iter() if e.tag=='table']
for table in tables :
    headerCells=[e for e in table.iter() if e.tag=='th']
    headerRows=[]
    for headerCell in headerCells:
        if headerCell.getparent().tag=='tr':
            if headerCell.getparent() not in headerRows:
                headerRows.append(headerCell.getparent())
    for headerRow in headerRows:
        newHeaderCells=[e for e in headerRow.iter() if e.tag=='th']
        #Now I will extract some data and attributes from the th elements

score 1 · Accepted Answer

遍历所有tr标签，当你找不到th里面的时候就继续下一个。

编辑。这是如何：

from lxml import html
theString=unicode(open('c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls').read(),'UTF-8','replace')
theTree=html.fromstring(theString)
for table in theTree.iter('table'):
    for row in table.findall('tr'):
        headerCells = list(row.findall('th'))
        if headerCells:
            #extract data from row and headerCells

score 1 · Accepted Answer

为避免这样做两次，您可以使用以行元素为键的字典，并将给定行中的所有标题单元格累积到关联列表中，这可以通过表的元素一次完成。要保持行按出现时间排序，您可以使用OrderedDict内置collections模块中的 an 。这将允许编写以下内容：

from lxml import html
from collections import OrderedDict
f='c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls'
theString=unicode(open(f).read(),'UTF-8','replace')
theTree=html.fromstring(theString)
tables=[e for e in theTree.iter() if e.tag=='table']
for table in tables:
    headerRowDict=OrderedDict()
    for e in table.iter():
        if e.tag=='th':
            headerRowDict.setdefault(e.getparent(), []).append(e)
    for headerRow in headerRowDict:
        for headerRowCell in headerRow:
            # extract data and attributes from the <th> element from the row...

python - 在 lxml 中测试元素时避免循环

2 回答 2

Related

Reference