如何使用scrapy抓取每个href?我只知道如何显示所有内容,但我希望能够进入每个链接。这是我们的 Intranet 数据,因此您将无法访问这些链接。另外,当数据显示在文件中时,如何格式化日期?我需要在 start_url 中添加一个 url 列表吗?我需要将我的 initSpider 更改为 crawlSpider 吗?
<row>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">14256238845</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">353918053831794</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">310260548400764</cell>
<cell type="href" href="/dis/packages.jsp?view=timeline&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100&date=20130423T020032243">2013-04-23 02:00:32.243</cell>
<cell type="plain">2013-04-23 02:00:32.243</cell>
<cell type="plain">3 - PackageCreation</cell>
<cell type="href" href="/dis/profile_download?profileId=400006">400006</cell>
<cell type="href" href="/dis/sessions.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view sessions</cell>
<cell type="href" href="/dis/errors_agg.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view errors</cell>
</row>
到目前为止,这就是我所拥有的,它可以打印所有内容
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import XmlXPathSelector
from carrier.items import CarrierItem
class CarrierSpider(InitSpider):
name = 'dis'
allowed_domains = ['qvpweb01.ciq.labs.att.com']
login_page = 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'
start_urls = ["https://qvpweb01.ciq.labs.att.com:8080/dis/"]
def init_request(self):
#"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'txtUserName': 'myuser', 'txtPassword': 'xxxx'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "logout" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
return self.initialized()
else:
self.log("\n\n\nFailed, Bad password :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
xhs = XmlXPathSelector(response)
columns = xhs.select('//table[3]/row/cell')
for column in columns:
item = CarrierItem()
item['title'] = column.select('.//text()').extract()
item['link'] = column.select('.//@href').extract()
yield item
我从下面的 csv 文件得到的输出:
14256238845
3.53918E+14
3.10261E+14
00:32.2
00:32.2
3 - PackageCreation
400006
view sessions
view errors
我希望得到下面的 csv 输出:
14256238845
353918053831794
310260548400764
2013-04-23 02:00:32.243
2013-04-23 02:00:32.243
3 - PackageCreation
400006
view sessions
view errors