我正在尝试使用scrapy来抓取一个包含多页信息的网站。
我的代码是:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item
class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["http://www.tcgplayer.com/"]
start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[@class='magicCard']")
for title in titles:
item = Tcgplayer1Item()
item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]
vendor = title.xpath(".//tr[@class='vendor ']")
item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
yield item
我试图刮掉所有的页面,直到它到达页面的末尾……有时页面会比其他页面多,所以很难确切地说页码在哪里结束。