python - 如何使python webcrawler无限并记录一次链接

Question

在 thenewboston 的帮助下，我能够在 python 中创建一个不错的小型网络爬虫。看完他的视频后，我玩弄了它并添加了一些东西。我试图让它无限，因为它会记录每个链接上的每个链接，但我没有这样做。我也有多次记录同一链接的问题？我将如何解决这个问题？

这是我的代码。

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = ''
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"):
            href = link.get("href")
            title = link.get("title")
            links = []
            #print(href)
            #print(title)
            try:
                get_single_user_data(href)
            except:
                pass
        page += 1

def get_single_user_data(user_url):
    source_code = requests.get(user_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    #for item_name in soup.findAll('span', {'id':'mm-saleDscPrc'}):
    #   print(item_name.string)
    for link in soup.findAll("a"):
        href = link.get("href")
        print(href)


spider(1)

score -1 · Accepted Answer

我试图让它无限，因为它会在每个记录的每个链接上获取每个链接

除非您拥有一个规模适中的数据中心，否则这不会发生。但为了它。你只需要一个更大的网站起始池来抓取到其他网站的链接，你就会走得足够远。从 Reddit 之类的所有出站链接开始。

我也有多次记录同一链接的问题？

我建议使用哈希表记录您访问过的链接来记录您访问过的网站，并在访问之前检查链接是否存在。

python - 如何使python webcrawler无限并记录一次链接

1 回答 1

Related

Reference