python - 为什么这个递归停止

Question

我是 python 新手，在下面的代码中：我有一个在新发现的链接上递归的爬虫。在根链接上递归之后，似乎程序在打印了几个链接后就停止了，这应该会持续一段时间，但事实并非如此。我正在捕获并打印异常，但程序成功终止，所以我不确定它为什么停止。

from urllib import urlopen
from bs4 import BeautifulSoup

def crawl(url, seen):
    try:
    if any(url in s for s in seen):
       return 0
    html = urlopen(url).read()

    soup = BeautifulSoup(html)
    for tag in soup.findAll('a', href=True):
        str = tag['href']
        if 'http' in str:
        print tag['href']
        seen.append(str)
        print "--------------"
        crawl(str, seen)
    except Exception, e:
      print e
      return 0

def main ():
    print "$ = " , crawl("http://news.google.ca", [])


if __name__ == "__main__":
    main()

score 1 · Accepted Answer

    for tag in soup.findAll('a', href=True):
        str = tag['href']
        if 'http' in str:
            print tag['href']
            seen.append(str)        # you put the newly founded url to *seen*
            print "--------------"
            crawl(str, seen)        # then you try to crawl it

但是，在一开始crawl

if any(url in s for s in seen): # you don't crawl url in *seen*
   return 0

您应该url在真正抓取它时添加，而不是在找到它时添加。

score 0 · Accepted Answer

try:
    if any(url in s for s in seen):
       return 0

进而

seen.append(str)
print "--------------"
crawl(str, seen)

您追加str到seen, 然后crawl使用str和seen作为参数调用。显然你的代码退出了。你是这样设计的。

更好的方法是爬取一个页面，将找到的所有链接添加到要爬取的列表中，然后继续爬取该列表中的所有链接。

简单来说，你应该做广度优先爬行，而不是深度优先爬行。

像这样的东西应该工作。

from urllib import urlopen
from bs4 import BeautifulSoup

def crawl(url, seen, to_crawl):
    html = urlopen(url).read()
    soup = BeautifulSoup(html)
    seen.append(url)
    for tag in soup.findAll('a', href=True):
        str = tag['href']
        if 'http' in str:
            if url not in seen and url not in to_crawl:
                to_crawl.append(str)
                print tag['href']
                print "--------------"
    crawl(to_crawl.pop(), seen, to_crawl)

def main ():
    print "$ = " , crawl("http://news.google.ca", [], [])


if __name__ == "__main__":
    main()

尽管您可能想限制它会抓取的最大深度或最大 URL 数。

python - 为什么这个递归停止

2 回答 2

Related

Reference