我正在尝试使用 Scrapy 抓取一个网站,并且我要抓取的每个页面的 url 都是使用这种相对路径编写的:
<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
<a href="../../en/item-to-scrap.html">Link</a>
现在,在我的浏览器中,这些链接可以正常工作,并且您可以访问诸如https://www.domain-name.com/en/item-to-scrap.html之类的网址(尽管相对路径在层次结构中返回两次而不是一次)
但是我的 CrawlSpider 无法将这些 url 翻译成“正确”的,我得到的只是这种错误:
2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request
有没有办法解决这个问题,或者我错过了什么?
这是我的蜘蛛代码,相当基本(基于与“/en/item-*-scrap.html”匹配的项目网址):
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product