python - 无法使用 set 踢出重复的结果

Question

我在 python 中编写了一个脚本来抓取标题为 as 的表下的所有链接England，然后在我的脚本到达内页时使用这些链接，然后它将抓取下一页链接。我知道如果我修复脚本中使用的 xpath，我可能会获得唯一的下一页 url。

但是，这里的主要目标是确定为什么我的脚本即使在我使用set().

我的脚本：

import requests
from lxml.html import fromstring
from urllib.parse import urljoin

link = "http://tennishub.co.uk/"

processed_links = set()
processed_nextpage_links = set()

def get_links(url):
    response = requests.get(url)
    tree = fromstring(response.text)

    unprocessed_links = [urljoin(link,item.xpath('.//a/@href')[0]) for item in tree.xpath('//*[@class="countylist"]')]
    for nlink in unprocessed_links:
        if nlink not in processed_links:
            processed_links.add(nlink)
    get_nextpage_links(processed_links)

def get_nextpage_links(itemlinks):
    for ilink in itemlinks:
        response = requests.get(ilink)
        tree = fromstring(response.text)
        titles = [title.xpath('.//a/@href')[0] for title in tree.xpath('//div[@class="pagination"]') if title.xpath('.//a/@href')]
        for ititle in titles:
            if ititle not in processed_nextpage_links:
                processed_nextpage_links.add(ititle)

        for rlink in processed_nextpage_links:
            print(rlink)

if __name__ == '__main__':
    get_links(link)

我得到的结果是：

/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Cheshire/2
/tennis-clubs-by-county/Derbyshire/2
/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Cheshire/2
/tennis-clubs-by-county/Derbyshire/2
/tennis-clubs-by-county/Durham/2

score 2 · Accepted Answer

每次调用时，您都在打印迄今为止收集的所有链接get_nextpage_links。

我猜你会想要print完全删除，并在完成后打印列表，最好是在任何之外def（使你的函数可重用，并将任何外部副作用推迟到调用代码）。

没有全局变量的更好的解决方案可能是get_links收集一个集合并返回它，在调用它时传递对集合的引用get_nextpage_links，并且（显然）添加任何新链接。

因为您使用的是集合，所以在添加之前不需要检查链接是否已经在集合中。无法向此数据类型添加副本。

score 2 · Accepted Answer

试试下面的脚本。事实证明，您的 xapth 有一些缺陷，这些缺陷正在解析几个块中的某个块，正如@tripleee 在他的评论中已经提到的（据说）。我set() 在以下脚本中使用了稍微不同的方式。现在，它应该产生独特的链接。

import requests
from lxml.html import fromstring
from urllib.parse import urljoin

link = "http://tennishub.co.uk/"

def get_links(url):
    response = requests.get(url)
    tree = fromstring(response.text)
    crude_links = set([urljoin(link,item) for item in tree.xpath('//*[@class="countylist"]//a/@href') if item])
    return crude_links

def get_nextpage(link):
    response = requests.get(link)
    tree = fromstring(response.text)
    titles = set([title for title in tree.xpath('//div[@class="pagination"]//a/@href') if title])
    return titles

if __name__ == '__main__':
    for next_page in get_links(link):
        for unique_link in get_nextpage(next_page):
            print(unique_link)

score 1 · Accepted Answer

每次你打电话

        for rlink in processed_nextpage_links:
            print(rlink)

您正在打印它，因为您的 for 循环位于 for 循环中，在您的集合中添加了链接

python - 无法使用 set 踢出重复的结果

3 回答 3

Related

Reference