python - BeautifulSoup 表处理

Question

我处于一种情况，我必须做一些有点骇人听闻的事情。传入的数据不在我的控制之下，所以解决方案不是“只是更有效地存储它”，就像我希望的那样。

我得到的东西看起来像

<table>
  <tr>
    <td>Key 1</td>
    <td>Key 2</td>
    <td>Key 3</td>
    ...
  </tr>
  <tr>
    <td>Val 1</td>
    <td>Val 2</td>
    <td>Val 3</td>
    ...
  </tr>
  ...
</table>

我想要的是从某些表中挑选某些键/值对。所以，像

{ 'Key 4': 'Val 4', 'Key 32': 'Val 32' ... }

我提前知道适当的键，但我不一定知道它们的位置，或者每对trs 代表 k/v 对（tables 用于定位和数据表示。不，我不知道不知道为什么。），所以最简单的解决方案似乎是get me the contents of the nth cell of the next row where n is this cells' index。

我所拥有的是

def findField(soup, fieldName):
    kTd = soup.find(text=fieldName).parent
    ix = len(kTd.findPreviousSiblings('td'))
    valTd = kTd.parent.findNext('tr').findAll('td')[ix]
    return (kTd, valTd)

def fieldsToDict(soup, fieldNames):
    return dict([findField(soup, k) for k in fieldNames])

fieldsToDict(soup, ['Key 4', 'Key 32' ..])

但似乎必须有一种更优雅和/或更有效的方式来表达这一点。

有什么想法吗？

编辑：我会更具体，虽然我可能想多了，这个问题可能应该放在 codereview.se 而不是 SO。我想从比我拥有更多 Python/BeautifulSoup 的人那里得到两件具体的事情。

第一的，

...
    ix = len(soup.findPreviousSiblings('td'))
...

看起来对于较大的行来说它可能会变得相对昂贵。似乎我试图获取的信息可以在 HTML 的初始解析期间获得。是否有内置的方法/插槽indexAmongPeers？

第二

...
    return dict([findField(soup, k) for k in fieldNames])

dict在该行中，似乎它必须对来自该理解的列表进行另一次遍历。在这种情况下会优化吗？有没有办法一次性完成？

score 0 · Accepted Answer

I think the looking up is a little bit hard to follow - I'd go for the following:

rows = [tr.strings for tr in soup('tr')]
lookup = {k:v for k,v in zip(*rows) if k in {'Key 1', 'Key 2'}}
# {u'Key 1': u'Val 1', u'Key 2': u'Val 2'}

python - BeautifulSoup 表处理

1 回答 1

Related

Reference