0

您好,我正在尝试获取类 listCell 的标题和文本的 xpath。我相信我做得对,因为我没有收到任何错误,但是当我在 csv 文件中显示它时,我在输出文件中什么也没有。我还在亚马逊等其他网站上测试了我的scrapy,它运行良好,但不适用于这个网站。请帮忙!!

    def parse(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
    items = []
    for site in sites:
        item = CarrierItem()
        item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract()
        item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract()
        items.append(item)
    return items

这是我的html。有没有可能它不能工作,因为它在 html 中有 javascript?

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title> Carrier IQ DIS 2.4 :: All Devices</title>
<script type="text/javascript" src="/dis/js/main.js">
<script type="text/javascript" src="/dis/js/validate.js">
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css">
<link rel="stylesheet" type="text/css" href="/dis/css/style.css">
<script type="text/javascript">

    ....

<form id="listForm" name="listForm" method="POST" action="">
<table>
<thead>
<tbody>
<tr>
<td class="crt">1</td>
<td class="listCell" align="center">
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a>
</td>
<td class="listCell" align="center">
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a>
</td>
<td class="listCell" align="center">
<td class="listCell" align="center">
<td class="cell" align="center">2013-07-01 13:39:38.820</td>
<td class="cell" align="left">1 - SMS_PullRequest_CS</td>
<td class="listCell" align="right">
<td class="listCell" align="center">
<td class="listCell" align="center">
</tr>
</tbody>
</table>
</form>

输出

    C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
-t csv
2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-01 10:50:19-0500 [dis] INFO: Spider opened
2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
 items (at 0 items/min)
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/login.jsp> (referer: None)
2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
is/login>
2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
.jsp)
2013-07-01 10:50:20-0500 [dis] DEBUG:


    Successfully logged in. Let's start crawling!



2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
2013-07-01 10:50:21-0500 [dis] DEBUG:


     We got data!



2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished)
2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 1382,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 3,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 147888,
     'downloader/response_count': 4,
     'downloader/response_status_count/200': 3,
     'downloader/response_status_count/302': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000),
     'log_count/DEBUG': 12,
     'log_count/INFO': 4,
     'request_depth_max': 2,
     'response_received_count': 3,
     'scheduler/dequeued': 4,
     'scheduler/dequeued/memory': 4,
     'scheduler/enqueued': 4,
     'scheduler/enqueued/memory': 4,
     'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)}
2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished)
4

1 回答 1

0

尝试简化您的 XPath:

sites = hxs.select('//form[@id="listForm"]//tr')

由于tbody元素(在某些情况下)不存在于 HTML 中,而是由您的浏览器生成。

于 2013-07-02T08:35:06.270 回答