facebook 上的新图形搜索允许您使用查询令牌搜索公司的当前员工 -当前的 Google 员工(例如)。
我想通过 scrapy抓取结果页面( http://www.facebook.com/search/104958162837/employees/present )。
最初的问题是 facebook 只允许 facebook 用户访问信息,所以将我引导到 login.php。所以,在抓取这个 url 之前,我先通过 scrapy 登录,然后是这个结果页面。但即使此页面的 http 响应为 200,它也不会删除任何数据。代码如下:
import sys
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.http import Request
class DmozSpider(BaseSpider):
name = "test"
start_urls = ['https://www.facebook.com/login.php'];
task_urls = [query]
def parse(self, response):
return [FormRequest.from_response(response, formname='login_form',formdata={'email':'myemailid','pass':'myfbpassword'}, callback=self.after_login)]
def after_login(self,response):
if "authentication failed" in response.body:
self.log("Login failed",level=log.ERROR)
return
return Request(query, callback=self.page_parse)
def page_parse(self,response):
hxs = HtmlXPathSelector(response)
print hxs
items = hxs.select('//div[@class="_4_yl"]')
count = 0
print items
我可能错过了什么或做错了什么?