我是 python 新手,我正在尝试从黄页中抓取数据。我能够刮掉它,但我得到了一个混乱的结果。
这是我得到的结果:
2013-03-24 20:26:47+0800 [scrapy] INFO: Scrapy 0.14.4 started (bot: eyp)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware,DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled item pipelines:
2013-03-24 20:26:47+0800 [eyp] INFO: Spider opened
2013-03-24 20:26:47+0800 [eyp] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
我怎样才能得到一个干净的结果?我只想获取姓名、地址、电话号码和链接。
顺便说一句,我用来执行此操作的代码是;
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from eyp.items import EypItem
class EypSpider(BaseSpider):
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//ol[@class="result"]/li')
items = []
for title in titles:
item = EypItem()
item['title'] = title.select(".//p/text()").extract()
item['link'] = title.select(".//a/@href").extract()
items.append(item)
return items