0

我制作了一个网络爬虫,它为给定地址中的所有站点提供链接和链接文本,如下所示:

import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize

url = ["http://adbnews.com/area51"]


for u in url:
    br = mechanize.Browser()
    urls = [u]
    visited = [u]
    i = 0
    while i<len(urls):
        try:
            br.open(urls[0])
            urls.pop(0)

            for link in br.links():

                levelLinks = []
                linkText = [] 

                newurl = urlparse.urljoin(link.base_url, link.url)
                b1 = urlparse.urlparse(newurl).hostname
                b2 = urlparse.urlparse(newurl).path
                newurl = "http://"+b1+b2
                linkTxt = link.text
                linkText.append(linkTxt)
                levelLinks.append(newurl)


                if newurl not in visited and urlparse.urlparse(u).hostname in newurl:
                    urls.append(newurl)
                    visited.append(newurl)
                    #print newurl

                    #get Mechanize Links
                    for l,lt in zip(levelLinks,linkText):
                        print newurl,"\n",lt,"\n"


        except:
            urls.pop(0)

它得到这样的结果:

http://www.adbnews.com/area51/contact.html 
CONTACT 

http://www.adbnews.com/area51/about.html 
ABOUT 

http://www.adbnews.com/area51/index.html 
INDEX 

http://www.adbnews.com/area51/1st/ 
FIRST LEVEL! 

http://www.adbnews.com/area51/1st/bling.html 
BLING 

http://www.adbnews.com/area51/1st/index.html 
INDEX 

http://adbnews.com/area51/2nd/ 
2ND LEVEL 

我想添加一个可以限制爬虫深度的计数器。

例如,我尝试添加steps = 3并更改while i<len(urls)while i<steps:

但这只会进入第一级,即使数字显示为 3...

欢迎任何建议

4

1 回答 1

0

如果您想搜索某个“深度”,请考虑使用递归函数而不是仅附加 URL 列表。

def crawl(url, depth):
  if depth <= 3:
    #Scan page, grab links, title
    for link in links:
      print crawl(link, depth + 1)
  return url +"\n"+ title

这样可以更轻松地控制递归搜索,并且速度更快,资源更少:)

于 2013-08-06T08:38:59.327 回答