我正在尝试做一个递归蜘蛛来从具有特定链接结构的站点(例如:web.com)中提取内容。例如:
http://web.com/location/profile/12345678?qid=1403226397.5971&source=location&rank=21
http://web.com/location/profile/98765432?qid=1403366850.3991&source=locaton&rank=1
如您所见,只有 URL 的数字部分发生了变化,我需要爬取此 URL 结构后的所有链接并提取 itemX、itemY 和 itemZ。
我已将链接结构翻译成正则表达式,如下所示:'\d+?qid=\d+.\d+&source=location&rank=\d+'。Python-Scrapy 代码如下,但是,在我运行蜘蛛之后,蜘蛛没有提取任何内容:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from web.items import webItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy import log
import re
import urllib
class web_RecursiveSpider(CrawlSpider):
name = "web_RecursiveSpider"
allowed_domains = ["web.com"]
start_urls = ["http://web.com/location/profile",]
rules = (Rule (SgmlLinkExtractor(allow=('\d+?qid=\d+.\d+&source=location&rank=\d+', ),)
, callback="parse_item", follow= True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*')
items = []
for site in sites:
item = webItem()
item["itemX"] = site.select("//span[@itemprop='X']/text()").extract()
item["itemY"] = site.select("//span[@itemprop='Y']/text()").extract()
item["itemZ"] = site.select("//span[@itemprop='Z']/text()").extract()
items.append(item)
return items