我是一名 C/C++ 程序员,在制作绘图和文本处理方面具有有限的 python 经验。我目前正在从事个人数据分析项目,我正在使用 Scrapy 抓取论坛中的所有线程和用户信息。
我已经整理了一个初始代码,旨在首先登录,然后从子论坛的索引页面开始,执行以下操作:
1)提取所有包含“主题”的主题链接
2)暂时将页面保存在一个文件中(一旦整个过程开始,将提取项目信息)
3)找到标签class=next的下一页链接,转到下一页并重复1)和2)
我知道对于每个线程,我仍然需要浏览所有包含所有回复帖子的页面,但我计划在我当前的代码工作正确后执行此操作。
但是,我当前的代码只会提取起始 url 中的所有线程,然后停止。我已经搜索了几个小时,但没有找到任何解决方案。所以我在这里问我的问题,希望有 Scrapy 经验的人可以在这里帮助我。如果你们想要任何其他信息,例如输出,请告诉我。谢谢!
关于保罗的回复,我更新了我的代码,我的链接提取器有问题,我需要修复它。除此之外,该规则现在可以正常工作。再次感谢保罗的帮助。
这是我当前的蜘蛛代码:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.selector import Selector
class ZhuaSpider(CrawlSpider):
name = 'zhuaspider'
allowed_domains = ['depressionforums.org']
login_page = 'http://www.domain.com/forums/index.php?app=core&module=global§ion=login'
start_urls = ['http://www.depressionforums.org/forums/forum/12-depression-central/']
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//li[@class="next"]'), unique=True),
callback='parse_links',
follow=True),
)
def start_requests(self):
"""called before crawling starts. Try to login"""
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'ips_username': 'myuid', 'ips_password': 'mypwd'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are successfully logged in."""
if "Username or password incorrect" in response.body:
self.log("Login failed.")
else:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin.
for url in self.start_urls:
# explicitly ask Scrapy to run the responses through rules
yield Request(url, callback=self.parse)
def parse_links(self, response):
hxs = Selector(response)
links = hxs.xpath('//a[contains(@href, "topic")]')
for link in links:
title = ''.join(link.xpath('./@title').extract())
url = ''.join(link.xpath('./@href').extract())
meta={'title':title,}
yield Request(url, callback = self.parse_posts, meta=meta,)
#If I add this line it will only crawl the starting url,
#otherwise it still won't apply the rule and crawls nothing.
parse_start_url = parse_links
def parse_posts(self, response):
filename = 'download/'+ response.url.split("/")[-2]
open(filename, 'wb').write(response.body)