python - 使用 BeautifulSoup 从表中提取选定的列

Question

我正在尝试使用 BeautifulSoup提取此数据表的第一列和第三列。从 HTML 来看，第一列有一个<th>标签。感兴趣的另一列具有<td>标记。无论如何，我所能得到的只是带有标签的列的列表。但是，我只想要文字。

table已经是一个列表，所以我不能使用findAll(text=True). 我不确定如何以另一种形式获取第一列的列表。

from BeautifulSoup import BeautifulSoup
from sys import argv
import re

filename = argv[1] #get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody.th.findAll('th') #The relevant table is the first one

print table

score 38 · Accepted Answer

你可以试试这段代码：

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())

for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print first_column, third_column

如您所见，代码只是连接到 url 并获取 html，BeautifulSoup 找到第一个表，然后所有 'tr' 并选择第一列，即 'th'，第三列，即一个'td'。

score 3 · Accepted Answer

除了@jonhkr 的回答，我想我会发布一个我想出的替代解决方案。

 #!/usr/bin/python

 from BeautifulSoup import BeautifulSoup
 from sys import argv

 filename = argv[1]
 #get HTML file as a string
 html_doc = ''.join(open(filename,'r').readlines())
 soup = BeautifulSoup(html_doc)
 table = soup.findAll('table')[0].tbody

 data = map(lambda x: (x.findAll(text=True)[1],x.findAll(text=True)[5]),table.findAll('tr'))
 print data

与 jonhkr 的答案不同，它拨入网页，我的假设您将其保存在您的计算机上并将其作为命令行参数传递。例如：

python file.py table.html

score 0 · Accepted Answer

你也可以试试这段代码

import requests
from bs4 import BeautifulSoup
page =requests.get("http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm")
soup = BeautifulSoup(page.content, 'html.parser')
for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print (first_column, third_column)

python - 使用 BeautifulSoup 从表中提取选定的列

3 回答 3

Related

Reference