目前,我可以从 softpedia.com 获得无穷无尽的抓取链接(包括所需的安装程序链接,例如http://hotdownloads.com/trialware/download/Download_a1keylogger.zip?item=33649-3&affiliate=22260)。
spider.py 如下:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MySpider(CrawlSpider):
""" Crawl through web sites you specify """
name = "softpedia"
# Stay within these domains when crawling
allowed_domains = ["www.softpedia.com"]
start_urls = [
"http://win.softpedia.com/",]
download_delay = 2
# Add our callback which will be called for every found link
rules = [
Rule(SgmlLinkExtractor(), follow=True)
]
items.py、pipelines.py、settings.py 是默认的,除了在 settings.py 中添加了一行:
FILES_STORE = '/home/test/softpedia/downloads'
使用 urllib2,我可以判断一个链接是否是安装程序,在这种情况下,我在 content_type 中得到“应用程序”:
>>> import urllib2
>>> url = 'http://hotdownloads.com/trialware/download/Download_a1keylogger.zip?item=33649-3&affiliate=22260'
>>> response = urllib2.urlopen(url)
>>> content_type = response.info().get('Content-Type')
>>> print content_type
application/zip
我的问题是,如何收集所需的安装程序链接,并将它们下载到我的目标文件夹?提前致谢!
PS:
我现在找到了 2 种方法,但我无法让它们工作:
1. https://stackoverflow.com/a/7169241/2092480,我通过将以下代码添加到蜘蛛来遵循这个答案:
def parse_installer(self, response):
# extract links
lx = SgmlLinkExtractor()
urls = lx.extract_links(response)
for url in urls:
yield Request(url, callback=self.save_installer)
def save_installer(self, response):
path = self.get_path(response.url)
with open(path, "wb") as f: # or using wget
f.write(response.body)
蜘蛛只是因为这些代码不存在而我没有下载文件,有人能看到哪里出错了吗?
2. https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ,当我在 ["file_urls"] 中提供预定义链接时,此方法本身有效。但是如何设置 scrapy 来收集所有指向 ["file_urls"] 的安装程序链接?另外,我想对于这么简单的任务,上面的方法应该足够了。