我制作了一个网络爬虫,它为给定地址中的所有站点提供链接和链接文本,如下所示:
import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
url = ["http://adbnews.com/area51"]
for u in url:
br = mechanize.Browser()
urls = [u]
visited = [u]
i = 0
while i<len(urls):
try:
br.open(urls[0])
urls.pop(0)
for link in br.links():
levelLinks = []
linkText = []
newurl = urlparse.urljoin(link.base_url, link.url)
b1 = urlparse.urlparse(newurl).hostname
b2 = urlparse.urlparse(newurl).path
newurl = "http://"+b1+b2
linkTxt = link.text
linkText.append(linkTxt)
levelLinks.append(newurl)
if newurl not in visited and urlparse.urlparse(u).hostname in newurl:
urls.append(newurl)
visited.append(newurl)
#print newurl
#get Mechanize Links
for l,lt in zip(levelLinks,linkText):
print newurl,"\n",lt,"\n"
except:
urls.pop(0)
它得到这样的结果:
http://www.adbnews.com/area51/contact.html
CONTACT
http://www.adbnews.com/area51/about.html
ABOUT
http://www.adbnews.com/area51/index.html
INDEX
http://www.adbnews.com/area51/1st/
FIRST LEVEL!
http://www.adbnews.com/area51/1st/bling.html
BLING
http://www.adbnews.com/area51/1st/index.html
INDEX
http://adbnews.com/area51/2nd/
2ND LEVEL
我想添加一个可以限制爬虫深度的计数器。
例如,我尝试添加steps = 3
并更改while i<len(urls)
while i<steps:
但这只会进入第一级,即使数字显示为 3...
欢迎任何建议