0

我正在使用带有纯lxml的scrapy。如何让我的代码显示在一个漂亮的 csv 项目列表中,带有链接和标题。我需要我的代码在下面显示它。我在日志中得到不同的输出,但在我的 csv 文件中没有得到任何输出。

这是我想要的输出:

title                         link

16784478837                 (no link)
351746052167577             (no link)
_CC1                        metriclog.jsp?PKG_GID=40BA3ADF929DF82FEAC4A69279D48BFD&view=list
2013-04-08 18:37:05.366     (no link)
2013-04-08 18:37:09.144     (no link)
SMS_PullRequest_CS
400006                       /dis/profile_download?profileId=400006
6141                        (no link)

代码

class CarrierSpider(CrawlSpider):
    name = 'dis'
    allowed_domains = ['qvpweb01.ciq.labs.att.com']
    login_page = 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'
    start_urls = ["https://qvpweb01.ciq.labs.att.com:8080/dis/"]

    Rule(SgmlLinkExtractor(), follow=True)

    def start_requests(self):
        ...


    def login(self, response):
        ...

    def check_login_response(self, response):
        #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
        if "logout" in response.body:
            self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
            # Now the crawling can begin..
            self.log('this the login response! %s' % response.url)

            return Request(url="https://qvpweb01.ciq.labs.att.com:8080/dis/", callback=self.parse_device_list, dont_filter=True)

        else:
            self.log("\n\n\nFailed, Bad password :(\n\n\n")
            # Something went wrong, we couldn't log in, so nothing happens.


    def parse_device_list(self, response):
        self.log("\n\n\n List of devices \n\n\n")
        self.log('Hi, this is the parse_device_list page! %s' % response.url)
        root = lxml.etree.fromstring(response.body)
        for row in root.xpath('//row'):
            allcells = row.xpath('./cell')
            # first cell contain the link to follow
            detail_page_link = allcells[0].get("href")
            yield Request(urlparse.urljoin(response.url, detail_page_link ), callback=self.parse_page)

    def parse_page(self, response):
        self.log("\n\n\n Page for one device \n\n\n")
        self.log('Hi, this is the parse_page page! %s' % response.url)
        root = lxml.etree.fromstring(response.body)
        for row in root.xpath('//row'):
            allcells = row.xpath('./cell')
            #... populate Items
        for cells in allcells:
            item = CiqdisItem()
            item['title'] = cells.get('.//text()')
            item['link'] = cells.get("href")
            yield item

输出

https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp> (referer: None)
2013-07-23 11:13:49-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/dis/login>
2013-07-23 11:13:50-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.labs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp)
2013-07-23 11:13:50-0500 [dis] DEBUG: 


    Successfully logged in. Let's start crawling!



2013-07-23 11:13:50-0500 [dis] DEBUG: this the login response! https://qvpweb01.ciq.labs.att.com:8080/dis/
....
....
2013-07-23 11:13:50-0500 [dis] DEBUG: 


     List of devices 



2013-07-23 11:13:50-0500 [dis] DEBUG: Hi, this is the parse_device_list page! https://qvpweb01.ciq.labs.att.com:8080/dis/
2013-07-23 11:13:52-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=454D4F5864424F37575938666C4678522B56583947673D3D6A6139567256533863306C4D457269355A6239434A673D3D&hwdid=012615000163791&mdn=&subscrbid=310410394400380&maxlength=100> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
2013-07-23 11:13:52-0500 [dis] DEBUG: 


     Page for one device 



2013-07-23 11:13:52-0500 [dis] DEBUG: Hi, this is the parse_page page! https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': 'metriclog.jsp?PKG_GID=ECC023219F087F9335A6547374DCF7AC&view=list',
     'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': '/dis/packages.jsp?show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&triggerfilter=&maxlength=100&view=timeline&date=20130408T182947843',
     'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': '/dis/profile_download?profileId=400006', 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Scraped from <200 https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&maxlength=100>

    {'link': None, 'title': None}
2013-07-23 11:13:52-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.labs.att.com:8080/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
2013-07-23 11:13:52-0500 [dis] DEBUG: 

方法 parse_page() 的子页面中的 xml

<row>
<cell type="html">
<input type="checkbox" name="40BA3ADF929DF82FEAC4A69279D48BFD" value="40BA3ADF929DF82FEAC4A69279D48BFD" onclick="if(typeof(selectPkg)=='function')selectPkg(this);">
</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;" visible="false">http://qvpweb01.ciq.labs.att.com:8080/dis/metriclog.jsp?PKG_GID=40BA3ADF929DF82FEAC4A69279D48BFD&amp;view=list</cell>
<cell type="plain">16784478837</cell>
<cell type="plain">351746052167577</cell>
<cell type="href" style="width: 50px; white-space: nowrap;" href="metriclog.jsp?PKG_GID=40BA3ADF929DF82FEAC4A69279D48BFD&view=list">
_CC1
<input id="savePage_40BA3ADF929DF82FEAC4A69279D48BFD" type="hidden" value="40BA3ADF929DF82FEAC4A69279D48BFD">
</cell>
<cell type="href" href="/dis/packages.jsp?show=perdevice&device_gid=4C416E6335324E5758426D54587849646677435078773D3D38566A774A6A72787869754D432F4B55315A30466D773D3D&hwdid=351746052167577&mdn=16784478837&subscrbid=310410364765360&triggerfilter=&maxlength=100&view=timeline&date=20130408T183705366" style="white-space: nowrap;">2013-04-08 18:37:05.366</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;"></cell>
<cell type="plain" style="white-space: nowrap;"></cell>
<cell type="plain" style="white-space: nowrap;">2013-04-08 18:37:09.144</cell>
<cell type="plain" style="width: 70px; white-space: nowrap;">1 - SMS_PullRequest_CS</cell>
<cell type="href" style="width: 50px; white-space: nowrap;" href="/dis/profile_download?profileId=400006">400006</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;">6141</cell>
</row>  
4

0 回答 0