18

我正在尝试从该站点解析信息(html 表):http ://www.511virginia.org/RoadConditions.aspx?j=All&r=1

目前我正在使用 BeautifulSoup,我的代码看起来像这样

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table")

rows = table.findAll('tr')[3]

cols = rows.findAll('td')

roadtype = cols[0].string
start = cols.[1].string
end = cols[2].string
condition = cols[3].string
reason = cols[4].string
update = cols[5].string

entry = (roadtype, start, end, condition, reason, update)

print entry

问题在于开始列和结束列。他们只是被打印为“无”

输出:

(u'Rt. 613N (Giles County)', None, None, u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

我知道它们被存储在列列表中,但似乎额外的链接标记正在混淆原始 html 的解析,如下所示:

<td headers="road-type" class="ConditionsCellText">Rt. 613N (Giles County)</td>
<td headers="start" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Big Stony Ck Rd; Rt. 635E/W (Giles County)</a></td>
<td headers="end" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)</a></td>
<td headers="condition" class="ConditionsCellText">Moderate</td>
<td headers="reason" class="ConditionsCellText">snow or ice</td>
<td headers="update" class="ConditionsCellText">01/13/2010 10:50 AM</td>

所以应该打印的是:

(u'Rt. 613N (Giles County)', u'Big Stony Ck Rd; Rt. 635E/W (Giles County)', u'Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)', u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

任何建议或帮助表示赞赏,并提前感谢您。

4

2 回答 2

33
start = cols[1].find('a').string

或更简单

start = cols[1].a.string

或更好

start = str(cols[1].find(text=True))

entry = [str(x) for x in cols.findAll(text=True)]
于 2010-01-13T18:56:45.037 回答
2

我试图重现您的错误,但源 html 页面已更改。

关于错误,我遇到了类似的问题,尝试重现示例是here

更改Wikipedia Table的建议 URL

我修复了它移动到 BeautifulSoup4

from bs4 import BeautifulSoup

并改变.stringfor.get_text()

start = cols[1].get_text()

我无法使用您的示例进行测试(正如我之前所说,我无法重现该错误),但我认为这对于正在寻找解决此问题的人们可能很有用。

于 2014-01-18T14:05:57.137 回答