我在 python 中编写了一个脚本来抓取标题为 as 的表下的所有链接England
,然后在我的脚本到达内页时使用这些链接,然后它将抓取下一页链接。我知道如果我修复脚本中使用的 xpath,我可能会获得唯一的下一页 url。
但是,这里的主要目标是确定为什么我的脚本即使在我使用set()
.
我的脚本:
import requests
from lxml.html import fromstring
from urllib.parse import urljoin
link = "http://tennishub.co.uk/"
processed_links = set()
processed_nextpage_links = set()
def get_links(url):
response = requests.get(url)
tree = fromstring(response.text)
unprocessed_links = [urljoin(link,item.xpath('.//a/@href')[0]) for item in tree.xpath('//*[@class="countylist"]')]
for nlink in unprocessed_links:
if nlink not in processed_links:
processed_links.add(nlink)
get_nextpage_links(processed_links)
def get_nextpage_links(itemlinks):
for ilink in itemlinks:
response = requests.get(ilink)
tree = fromstring(response.text)
titles = [title.xpath('.//a/@href')[0] for title in tree.xpath('//div[@class="pagination"]') if title.xpath('.//a/@href')]
for ititle in titles:
if ititle not in processed_nextpage_links:
processed_nextpage_links.add(ititle)
for rlink in processed_nextpage_links:
print(rlink)
if __name__ == '__main__':
get_links(link)
我得到的结果是:
/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Cheshire/2
/tennis-clubs-by-county/Derbyshire/2
/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Cheshire/2
/tennis-clubs-by-county/Derbyshire/2
/tennis-clubs-by-county/Durham/2