python - 使用 BeautifulSoup 解析标签，无法提取值

Question

我有一些看起来像这样的html：

<tr>
  <td>some text</td>
  <td>some other text</td>
  <td>some <b>problematic</b> other <br /> text</td>
</tr>

和一些试图获取标签值并打印每个内部值的python：

soup = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
for row in soup.findAll('tr'):
    print repr(row) # this prints the whole 'tr' element text just fine.
    for col in row.contents:
        print col.string

所以全文正确打印捕获的html，但'col'为最后一个元素打印None：

some text
some other text
None

我不熟悉 BeatifulSoup 或 python，但似乎最后一个元素的内部标签导致解析问题？

谢谢

score 0 · Accepted Answer

您可以升级到 BeautifulSoup 版本 4 并使用.stripped_strings：

soup = BeautifulSoup(data)
for row in soup.find_all('tr'):
    print '\n'.join(row.stripped_strings)

在 BeautifulSoup 3 中，您需要搜索所有包含的文本：

for row in soup.findAll('tr'):
    print '\n'.join(el.strip() for row.findAll(text=True) if el.strip())

python - 使用 BeautifulSoup 解析标签，无法提取值

1 回答 1

Related

Reference