python - BeautifulSoup：跳过 html 元素

Question

我有以下 html 结构：这只是其中的一部分，但我认为这个片段足以解释我的问题。

<tr>
<td> Color Digest </td>
<td> AgAkAZwCJgMZ </td>
</tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>

我有以下代码来提取“Color Digest”标签的下一个兄弟

for td in soupPage.html.findAll('td'):
    if td.text == 'Color Digest':
        if td.nextSibling.text != " ":
            a = set()
            a = "[" + td.nextSibling.text.strip(",") + "]"
            print a

但我想跳过<td> AgAkAZwCJgMZ </td> 并获得价值<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>

我可以遵循的最佳beautifulsoup 机制是什么？

score 0 · Accepted Answer

更多的 html（例如整个表）将有助于编写更健壮的东西。

如果您知道要排除的字符串。使用 bs4 4.7.1

from bs4 import BeautifulSoup as bs

html = '''
<tr>
<td> Color Digest </td>
<td> AgAkAZwCJgMZ </td>
</tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
'''
soup = bs(html, 'lxml')
elems = [item.text for item in soup.select('td:contains("Color Digest") + td:not(:contains("AgAkAZwCJgMZ"))')]
print(elems)

如果您不使用返回列表上的索引

elems = [item.text for item in soup.select('td:contains("Color Digest") + td')][1]

score 0 · Accepted Answer

您可以通过以下方式实现：

import re
from bs4 import BeautifulSoup

html = """
<tr>
<td> Color Digest </td>
<td> AgAkAZwCJgMZ </td>
</tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
"""

soup = BeautifulSoup(html)
output = soup.find_all('td', text = re.compile(" Color Digest "))[1].find_next('td').text


print(output)

python - BeautifulSoup：跳过 html 元素

2 回答 2

Related

Reference