我正在学习如何在 Scraperwiki 中使用 Python 编写爬虫。到目前为止一切都很好,但是我花了几天的时间来解决一个我无法解决的问题。我正在尝试从表中获取所有链接。它可以工作,但从 001 到 486 的链接列表中,它只会在 045 处开始抓取它们。 url/source 只是网站上的城市列表,来源可以在这里看到:
http://www .tripadvisor.co.uk/pages/by_city.html和具体的 html 从这里开始:
</td></tr>
<tr><td class=dt1><a href="by_city_001.html">'s-Gravenzande, South Holland Province - Aberystwyth, Ceredigion, Wales</a></td>
<td class=dt1><a href="by_city_244.html">Los Corrales de Buelna, Cantabria - Lousada, Porto District, Northern Portugal</a></td>
</tr>
<tr><td class=dt1><a href="by_city_002.html">Abetone, Province of Pistoia, Tuscany - Adamstown, Lancaster County, Pennsylvania</a> /td>
<td class=dt1><a href="by_city_245.html">Louth, Lincolnshire, England - Lucciana, Haute-Corse, Corsica</a></td>
</tr>
<tr><td class=dt1><a href="by_city_003.html">Adamswiller, Bas-Rhin, Alsace - Aghir, Djerba Island, Medenine Governorate</a> </td>
<td class=dt1><a href="by_city_246.html">Luccianna, Haute-Corse, Corsica - Lumellogno, Novara, Province of Novara, Piedmont</a></td>
</tr>
我所追求的是从“by_city_001.html”到“by_city_486.html”的链接。这是我的代码:
def scrapeCityList(pageUrl):
html = scraperwiki.scrape(pageUrl)
root = lxml.html.fromstring(html)
print html
links = root.cssselect('td.dt1 a')
for link in links:
url = 'http://www.tripadvisor.co.uk' + link.attrib['href']
print url
在代码中调用如下:
scrapeCityList('http://www.tripadvisor.co.uk/pages/by_city.html')
现在,当我运行它时,它只会返回从 0045 开始的链接!
输出(045~486)
http://www.tripadvisor.co.ukby_city_045.html
http://www.tripadvisor.co.ukby_city_288.html
http://www.tripadvisor.co.ukby_city_046.html
http://www.tripadvisor.co.ukby_city_289.html
http://www.tripadvisor.co.ukby_city_047.html
http://www.tripadvisor.co.ukby_city_290.html and so on...
我尝试将选择器更改为:
links = root.cssselect('td.dt1')
它像这样抓取 487 个“元素”:
<Element td at 0x13d75f0>
<Element td at 0x13d7650>
<Element td at 0x13d76b0>
但我无法从中获得“href”值。当我在 cssselect 行中选择“a”时,我无法弄清楚为什么它会丢失前 44 个链接。我看过代码,但我不知道。
提前感谢您的帮助!
克莱尔