python - 如何使用 Python 和 BeautyfulSoup 对 HTML 表格进行排序

Question

我需要对这个 html 页面http://gnats.netbsd.org/summary/year/2012-perf.html进行分类，我需要从大表中列出最重要的问题。这是我在 Python.I 中的代码如果您能给我一些建议，将非常感激。

    import urllib.request
from bs4 import BeautifulSoup

# overall input
inputpage = urllib.request.urlopen("http://gnats.netbsd.org/summary/year/2012-perf.html")
page = inputpage.read()
soup = BeautifulSoup(page)

# checking tables
table = soup.findAll('table')
rows = soup.findAll('tr')
colomns = soup.findAll('td')

# inputing the lists
name = []
first = []
second = []
sum = []

# the main part
for tr in rows:
    if (tr==1):
        element = tr.split("<td>")
        name.append(element)
    elif (tr==2):
        element = tr.split("<td>")
        first.append(element)
    elif (tr==3):
        element = tr.split("<td>")
        second.append(element)


# combining the open and closed issue lists
length = len(first)
for i in range(length):
    sum = first[i] + second [i]

# printing the lists
length = len(sum)
for i in range(length):
    print (name[i] + '|' + sum[i])

score 0 · Accepted Answer

BeautifulSoup有一些很好的方法来访问子节点等等。例如，您可以使用tables = soup.findAll('table'). 假设您想在您发布的链接（表 [1]）中合并第二个表的数据，您可以执行以下操作

names = []
cdict = {0:[], 1:[]} # dictionary of "td positions to contents"

tables = soup.findAll('table')
for tt in tables[1].find_all('tr')[1:]: # skip first <tr> since it is the header
    names.append(tt.find_all('th')[0]) # 1st column is a th with the name
    for k, v in cdict.items():
        # append the <td>text</td> of column k to the corresponding list
        v.append(tt.find_all('td')[k].text)

所以，你最终会得到一个列字典 - > 列表，这样每个列表都包含 td 文本元素（使用字典的主要原因是因为你可能想要从列 1,2 和4，在这种情况下，您只需要更改 cdict 中的内容）。

要计算总和，您可以执行以下操作：

for i in xrange(len(names)):
    print names[i], int(cdict[0][i]) + int(cdict[1][i])

如果您查看每个元素的方法，您会看到一些非常好的功能，您可以使用这些功能来简化您的任务。

python - 如何使用 Python 和 BeautyfulSoup 对 HTML 表格进行排序

1 回答 1

Related

Reference