我正在使用带有纯lxml的scrapy。如何让我的代码显示在一个漂亮的 csv 项目列表中,带有链接和标题。我需要我的代码在下面显示它。我在日志中得到不同的输出,但在我的 csv 文件中没有得到任何输出。
这是我想要的输出:
title link
16784478837 (no link)
351746052167577 (no link)
_CC1 metriclog.jsp?PKG_GID=40BA3ADF929DF82FEAC4A69279D48BFD&view=list
2013-04-08 18:37:05.366 (no link)
2013-04-08 18:37:09.144 (no link)
SMS_PullRequest_CS
400006 /dis/profile_download?profileId=400006
6141 (no link)
代码
class CarrierSpider(CrawlSpider):
name = 'dis'
allowed_domains = ['qvpweb01.ciq.labs.att.com']
login_page = 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'
start_urls = ["https://qvpweb01.ciq.labs.att.com:8080/dis/"]
Rule(SgmlLinkExtractor(), follow=True)
def start_requests(self):
...
def login(self, response):
...
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "logout" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
self.log('this the login response! %s' % response.url)
return Request(url="https://qvpweb01.ciq.labs.att.com:8080/dis/", callback=self.parse_device_list, dont_filter=True)
else:
self.log("\n\n\nFailed, Bad password :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_device_list(self, response):
self.log("\n\n\n List of devices \n\n\n")
self.log('Hi, this is the parse_device_list page! %s' % response.url)
root = lxml.etree.fromstring(response.body)
for row in root.xpath('//row'):
allcells = row.xpath('./cell')
# first cell contain the link to follow
detail_page_link = allcells[0].get("href")
yield Request(urlparse.urljoin(response.url, detail_page_link ), callback=self.parse_page)
def parse_page(self, response):
self.log("\n\n\n Page for one device \n\n\n")
self.log('Hi, this is the parse_page page! %s' % response.url)
root = lxml.etree.fromstring(response.body)
for row in root.xpath('//row'):
allcells = row.xpath('./cell')
#... populate Items
for cells in allcells:
item = CiqdisItem()
item['title'] = cells.get('.//text()')
item['link'] = cells.get("href")
yield item
输出
https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp> (referer: None)
2013-07-23 11:13:49-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/dis/login>
2013-07-23 11:13:50-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.labs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp)
2013-07-23 11:13:50-0500 [dis] DEBUG:
Successfully logged in. Let's start crawling!
2013-07-23 11:13:50-0500 [dis] DEBUG: this the login response! https://qvpweb01.ciq.labs.att.com:8080/dis/
....
....
2013-07-23 11:13:50-0500 [dis] DEBUG:
List of devices
2013-07-23 11:13:50-0500 [dis] DEBUG: Hi, this is the parse_device_list page! https://qvpweb01.ciq.labs.att.com:8080/dis/
2013-07-23 11:13:52-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=454D4F5864424F37575938666C4678522B56583947673D3D6A6139567256533863306C4D457269355A6239434A673D3D&hwdid=012615000163791&mdn=&subscrbid=310410394400380&maxlength=100> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
2013-07-23 11:13:52-0500 [dis] DEBUG:
Page for one device
2013-07-23 11:13:52-0500 [dis] DEBUG: Hi, this is the parse_page page! https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': 'metriclog.jsp?PKG_GID=ECC023219F087F9335A6547374DCF7AC&view=list',
'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': '/dis/packages.jsp?show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&triggerfilter=&maxlength=100&view=timeline&date=20130408T182947843',
'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': '/dis/profile_download?profileId=400006', 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>
{'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
2013-07-23 11:13:52-0500 [dis] DEBUG:
方法 parse_page() 的子页面中的 xml
<row>
<cell type="html">
<input type="checkbox" name="40BA3ADF929DF82FEAC4A69279D48BFD" value="40BA3ADF929DF82FEAC4A69279D48BFD" onclick="if(typeof(selectPkg)=='function')selectPkg(this);">
</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;" visible="false">http://qvpweb01.ciq.labs.att.com:8080/dis/metriclog.jsp?PKG_GID=40BA3ADF929DF82FEAC4A69279D48BFD&view=list</cell>
<cell type="plain">16784478837</cell>
<cell type="plain">351746052167577</cell>
<cell type="href" style="width: 50px; white-space: nowrap;" href="metriclog.jsp?PKG_GID=40BA3ADF929DF82FEAC4A69279D48BFD&view=list">
_CC1
<input id="savePage_40BA3ADF929DF82FEAC4A69279D48BFD" type="hidden" value="40BA3ADF929DF82FEAC4A69279D48BFD">
</cell>
<cell type="href" href="/dis/packages.jsp?show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&triggerfilter=&maxlength=100&view=timeline&date=20130408T183705366" style="white-space: nowrap;">2013-04-08 18:37:05.366</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;"></cell>
<cell type="plain" style="white-space: nowrap;"></cell>
<cell type="plain" style="white-space: nowrap;">2013-04-08 18:37:09.144</cell>
<cell type="plain" style="width: 70px; white-space: nowrap;">1 - SMS_PullRequest_CS</cell>
<cell type="href" style="width: 50px; white-space: nowrap;" href="/dis/profile_download?profileId=400006">400006</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;">6141</cell>
</row>