python - 在 python beautifulsoup 中加入如何工作

Question

我正在学习python和beautifulsoup，并在网上看到了这段代码：

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile(r'Table\s+1',re.IGNORECASE)
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text
table = foundtext.findNext('table') # Find the first <table> tag that follows it
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
    print

虽然其他一切都很清楚，但我不明白加入是如何工作的。

    text = ''.join(td.find(text=True))

我尝试在 BS 文档中搜索 join，但我找不到任何东西，也无法真正在线找到有关如何在 BS 中使用 join 的帮助。

请让我知道那条线是如何工作的。谢谢！

PS：上面的代码来自另一个stackoverflow页面，它不是我的作业：）如何在Python中使用BeautifulSoup在文本字符串之后找到一个表格？

score 6 · Accepted Answer

''.join()是一个 python 函数，而不是任何特定于 BS 的函数。它让您以字符串作为连接值连接一个序列：

>>> '-'.join(map(str, range(3)))
'0-1-2'
>>> ' and '.join(('bangers', 'mash'))
'bangers and mash'

''只是一个空字符串，并且可以更容易地将一整组字符串连接成一个大字符串：

>>> ''.join(('5', '4', 'apple', 'pie'))
'54applepie'

在您的示例的特定情况下，该语句查找元素中包含的所有文本<td>，包括任何包含的 HTML 元素，例如<b>or <i>，<a href="">并将它们全部放在一个长字符串中。所以td.find(text=True)找到一系列 python 字符串，''.join()然后将它们连接成一个长字符串。

score 0 · Accepted Answer

Join 不是 BeautifulSoup 的一部分，而是 Python 中内置的字符串方法。它将一系列元素与给定的字符串连接在一起；例如，'+'.join(['a', 'b', 'c'])是a+b+c。请参阅文档。

score 0 · Accepted Answer

代码不正确。这一行：

text = ''.join(td.find(text=True))

使用 find，它返回 td 标记的第一个字符串子项并尝试对其使用 join。它可以正常工作，因为 ''.join() 只是迭代第一个字符串子项，创建一个副本。

所以这：

<td>foo<b>bar</b></td>

只需运行 ''.join("foo")。

相反，请使用 td.text 属性。它会自动查找 td 中的所有字符串并将它们连接起来。

text = td.text

python - 在 python beautifulsoup 中加入如何工作

3 回答 3

Related

Reference