python - Python beautifulsoup 遍历表

Question

我正在尝试将表数据抓取到 CSV 文件中。不幸的是，我遇到了障碍，下面的代码只是为所有后续的 TR 重复第一个 TR 中的 TD。

import urllib.request
from bs4 import BeautifulSoup

f = open('out.txt','w')

url = "http://www.international.gc.ca/about-a_propos/atip-aiprp/reports-rapports/2012/02-atip_aiprp.aspx"
page = urllib.request.urlopen(url)

soup = BeautifulSoup(page)

soup.unicode

table1 = soup.find("table", border=1)
table2 = soup.find('tbody')
table3 = soup.find_all('tr')

for td in table3:
    rn = soup.find_all("td")[0].get_text()
    sr = soup.find_all("td")[1].get_text()
    d = soup.find_all("td")[2].get_text()
    n = soup.find_all("td")[3].get_text()

    print(rn + "," + sr + "," + d + ",", file=f)

这是我的第一个 Python 脚本，因此我们将不胜感激！我查看了其他问题的答案，但无法弄清楚我在这里做错了什么。

score 56 · Accepted Answer

find()每次使用or时，您都是从文档的顶层开始find_all()，因此，例如，当您要求所有“td”标签时，您将获得文档中的所有“td”标签，而不仅仅是那些在您搜索的表和行中。您最好不要搜索那些，因为它们没有按照您的代码编写方式使用。

我想你想做这样的事情：

table1 = soup.find("table", border=1)
table2 = table1.find('tbody')
table3 = table2.find_all('tr')

或者，你知道，更像这样的东西，具有更多描述性的变量名称来引导：

rows = soup.find("table", border=1).find("tbody").find_all("tr")

for row in rows:
    cells = row.find_all("td")
    rn = cells[0].get_text()
    # and so on

score 10 · Accepted Answer

问题是，每次你试图缩小搜索范围（在这个 tr 中获取第一个 td 等）时，你只是在回拨汤。Soup 是顶级对象——它代表整个文档。你只需要调用一次soup，然后使用它的结果代替soup 进行下一步。

例如（变量名称更改为更清晰），

table = soup.find('table', border=1)
rows = table.find_all('tr')

for row in rows:
    data = row.find_all("td")
    rn = data[0].get_text()
    sr = data[1].get_text()
    d = data[2].get_text()
    n = data[3].get_text()

    print(rn + "," + sr + "," + d + ",", file=f)

我不确定 print 语句是执行您在此处尝试执行的操作的最佳方式（至少，您应该使用字符串格式而不是加法），但我将其保留原样，因为它不是核心问题。

Also, for completion: soup.unicode won't do anything. You're not calling a method there, and there's no assignment. I don't remember BeautifulSoup having a method named unicode in the first place, but I'm used to BS 3.0 so it may be new in 4.

python - Python beautifulsoup 遍历表

2 回答 2

Related

Reference