你能以我扭曲的方式告诉我错误吗?很长一段时间以来,我一直在努力构建一个快速的网络爬虫。使用 Queue构建传统的线程刮板是小菜一碟,到目前为止,速度更快。不过,我想比较扭曲!webscraper 的目标是递归地从图库中找到图像 () 链接,并连接到这些图像链接以抓取图像 () 和/或收集更多图像链接以供以后解析。代码如下所示。大多数函数都传递一个字典,因此我可以更概念地对每个链接的所有信息进行打包。我尝试线程化否则会阻塞代码(parsePage 函数)并使用“异步代码”(或者我相信)来检索 html 页面、标题信息和图像。
到目前为止,我的主要问题是从我的 getLinkHTML 或 getImgHeader errback 中追踪到大量“用户超时导致连接失败”。我曾尝试限制使用信号量建立的连接数量,甚至导致我的一些代码休眠无济于事,以为我淹没了连接。我还认为问题可能来自 reactor.connectTCP 因为超时错误在运行scraper大约 30 秒后产生,并且 connectTCP 有 30 秒超时。但是,我将twisted模块的connectTCP代码修改为60s,运行后大约30秒仍然出现超时错误。当然,用我传统的螺纹刮刀刮取相同的网站工作正常,而且要快得多。
那么我做错了什么?另外,由于我是自学成才的,请随时对我的代码提出批评,并且我在整个代码中也有一些随机问题。任何建议都非常感谢!
from twisted.internet import defer
from twisted.internet import reactor
from twisted.web import client
from lxml import html
from StringIO import StringIO
from os import path
import re
start_url = "http://www.thesupermodelsgallery.com/"
directory = "/home/z0e/Pictures/Pix/Twisted"
min_img_size = 100000
#maximum <a> links to get from main gallery
max_gallery_links = 500
#maximum <a> links to get from subsequent gallery/pages
max_picture_links = 35
def parsePage(info):
def linkFilter(link):
#filter unwanted <a> links
if link is not None:
trade_match = re.search(r'&trade=', link)
href_split = link.split('=')
for i in range(len(href_split)):
if 'www' in href_split[i] and i > 0:
link = href_split[i]
end_pattern = r'\.(com|com/|net|net/|pro|pro/)$'
end_match = re.search(end_pattern, link)
p_pattern = r'(.*)&p'
p_match = re.search(p_pattern, link)
if end_match or trade_match:
return None
elif p_match:
link = p_match.group(1)
return link
else:
return link
else:
return None
# better to handle a link with 'None' value through TypeError
# exception or through if else statements? Compare linkFilter
# vs. imgFilter functions
def imgFilter(link):
#filter <img> links to retain only .jpg
try:
jpg_match = re.search(r'.jpg', link)
if jpg_match is not None:
return link
else:
return None
except TypeError:
return None
link_num = 0
gallery_flag = None
info['level'] += 1
if info['page'] is '':
return None
# use lxml to parse and get document root
tree = html.parse(StringIO(info['page']))
root = tree.getroot()
root.make_links_absolute(info['url'])
# info['level'] = 1 corresponds to first recursive layer (i.e. main gallery page)
# info['level'] > 1 will be all other <a> links from main gallery page
if info['level'] == 1:
link_cap = max_gallery_links
gallery_flag = True
else:
link_cap = max_picture_links
gallery_flag = False
if info['level'] > 4:
return None
else:
# get <img> links if page is not main gallery ('gallery_flag = False')
# put <img> links back into main event loop to extract header information
# to judge pictures by picture size (i.e. content-length)
if not gallery_flag:
for elem in root.iter('img'):
# create copy of info so that dictionary no longer points to
# previous dictionary, but new dictionary for each link
info = info.copy()
info['url'] = imgFilter(elem.get('src'))
if info['url'] is not None:
reactor.callFromThread(getImgHeader, info)
# get <a> link and put work back into main event loop (i.e. w/
# reactor.callFromThread...) to getPage and then parse, continuing the
# cycle of linking
for elem in root.iter('a'):
if link_num > link_cap:
break
else:
img = elem.find('img')
if img is not None:
link_num += 1
info = info.copy()
info['url'] = linkFilter(elem.get('href'))
if info['url'] is not None:
reactor.callFromThread(getLinkHTML, info)
def getLinkHTML(info):
# get html from <a> link and then send page to be parsed in a thread
d = client.getPage(info['url'])
d.addCallback(parseThread, info)
d.addErrback(failure, "getLink Failure: " + info['url'])
def parseThread(page, info):
print 'parsethread:', info['url']
info['page'] = page
reactor.callInThread(parsePage, info)
def getImgHeader(info):
# get <img> header information to filter images by image size
agent = client.Agent(reactor)
d = agent.request('HEAD', info['url'], None, None)
d.addCallback(getImg, info)
d.addErrback(failure, "getImgHeader Failure: " + info['url'])
def getImg(img_header, info):
# download image only if image is above a certain threshold size
img_size = img_header.headers.getRawHeaders('Content-Length')
if int(img_size[0]) > min_img_size and img_size is not None:
img_name = ''.join(map(urlToName, info['url']))
client.downloadPage(info['url'], path.join(directory, img_name))
else:
img_header, link = None, None #Does this help garbage collecting?
def urlToName(char):
#convert all unwanted characters to '-' from url and use as file name
if char in '/\?|<>"':
return '-'
else:
return char
def failure(error, url):
print error
print url
def main():
info = dict()
info['url'] = start_url
info['level'] = 0
reactor.callWhenRunning(getLinkHTML, info)
reactor.suggestThreadPoolSize(2)
reactor.run()
if __name__ == "__main__":
main()