python - python网络爬虫中的递归

Question

我正在尝试在 python 中制作一个小的网络爬虫。现在似乎让我绊倒的是这个问题的递归部分和深度。给定一个 url 和一个我想链接到的站点的最大深度，然后将 url 添加到搜索的站点集，并从站点下载所有文本和链接。对于 url 中包含的所有链接，我想搜索每个链接并获取它的单词和链接。问题是，当我去递归调用下一个 url 时，深度已经是 maxDepth 并且它在只多一页后停止。希望我能体面地解释它，基本上我要问的问题是如何进行所有递归调用，然后设置 self._depth += 1？

def crawl(self,url,maxDepth):        

    self._listOfCrawled.add(url)

    text = crawler_util.textFromURL(url).split()

    for each in text:
        self._index[each] = url

    links = crawler_util.linksFromURL(url)

    if self._depth < maxDepth:
        self._depth = self._depth + 1
        for i in links:
            if i not in self._listOfCrawled:
                self.crawl(i,maxDepth)

score 3 · Accepted Answer

您的代码的问题是self.depth每次调用函数时都会增加，并且由于它是实例的变量，因此在以下调用中它会保持增加。假设maxDepth是 3，并且您有一个A链接到页面的 URLB和C，并且B链接到D，并且C有一个链接到E。然后，您的调用层次结构如下所示（假设self._depth开头为 0）：

crawl(self, A, 3)          # self._depth set to 1, following links to B and C
    crawl(self, B, 3)      # self._depth set to 2, following link to D
        crawl(self, D, 3)  # self._depth set to 3, no links to follow
    crawl(self, C, 3)      # self._depth >= maxDepth, skipping link to E

换句话说，您跟踪对的累计调用次数，而不是depth当前调用。crawl

相反，尝试这样的事情：

def crawl(self,url,depthToGo):
    # call this method with depthToGo set to maxDepth
    self._listOfCrawled.add(url)
    text = crawler_util.textFromURL(url).split()
    for each in text:
        # if word not in index, create a new set, then add URL to set
        if each not in self._index:
            self._index[each] = set([])
        self._index[each].add(url)
    links = crawler_util.linksFromURL(url)
    # check if we can go deeper
    if depthToGo > 0:
        for i in links:
            if i not in self._listOfCrawled:
                # decrease depthToGo for next level of recursion
                self.crawl(i, depthToGo - 1)

python - python网络爬虫中的递归

1 回答 1

Related

Reference