我正在使用scrapy 的CrawlSpider
蜘蛛类来遍历列表start_urls
并爬取每个站点的内部页面以获取电子邮件地址。我想为每个 start_url 导出一个包含单个(唯一)项目的文件,以及匹配的电子邮件列表。为此,我需要重写make_requests_from_url
andparse
方法,以便可以将响应的元字典(参见代码)中的每个 start_url 项传递给内部页面。运行此代码的输出是:
www.a.com,['webmaster@a.com']
www.a.com,['webmaster@a.com','info@a.com']
www.a.com,['webmaster@a.com','info@a.com','admin@a.com']
但是,我只希望导出文件包含上述输出中的最后一个条目
(www.a.com,['admin@a.com,webmaster@a.com, info@a.com'])
那可能吗?
代码:
class MySpider(CrawlSpider):
start_urls = [... urls list ...]
def parse(self, response):
for request_or_item in CrawlSpider.parse(self, response):
if isinstance(request_or_item, Request):
request_or_item.meta.update(dict(url_item=response.meta['url_item']))
yield request_or_item
def make_requests_from_url(self, url):
# Create a unique item for each url. Append email to this item from internal pages
url_item = MyItem()
url_item["url"] = url
url_item["emais"] = []
return Request(url, dont_filter=True, meta = {'url_item': url_item})
def parse_page(self, response):
url_item = response.meta["url_item"]
url_item["emails"].append(** some regex of emails from the response object **)
return url_item