python - 避免由于相对 url 而产生的错误请求

Question

我正在尝试使用 Scrapy 抓取一个网站，并且我要抓取的每个页面的 url 都是使用这种相对路径编写的：

<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
<a href="../../en/item-to-scrap.html">Link</a>

现在，在我的浏览器中，这些链接可以正常工作，并且您可以访问诸如https://www.domain-name.com/en/item-to-scrap.html之类的网址（尽管相对路径在层次结构中返回两次而不是一次）

但是我的 CrawlSpider 无法将这些 url 翻译成“正确”的，我得到的只是这种错误：

2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request

有没有办法解决这个问题，或者我错过了什么？

这是我的蜘蛛代码，相当基本（基于与“/en/item-*-scrap.html”匹配的项目网址）：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class Product(Item):
    name = Field()

class siteSpider(CrawlSpider):
    name = "domain-name.com"
    allowed_domains = ['www.domain-name.com']
    start_urls = ["https://www.domain-name.com/en/"]
    rules = (
        Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
        Rule(SgmlLinkExtractor(allow=('')), follow=True),
    )

    def parse_item(self, response):
        x = HtmlXPathSelector(response)
        product = Product()
        product['name'] = ''
        name = x.select('//title/text()').extract()
        if type(name) is list:
            for s in name:
                if s != ' ' and s != '':
                    product['name'] = s
                    break
        return product

score 2 · Accepted Answer

基本上，scrapy 使用http://docs.python.org/2/library/urlparse.html#urlparse.urljoin通过加入 currenturl 和废弃的 url 链接来获取下一个 url。如果您加入您提到的网址作为示例，

<!-- on page https://www.domain-name.com/en/somelist.html -->
<a href="../../en/item-to-scrap.html">Link</a>

返回的 url 与错误 scrapy 错误中提到的 url 相同。在 python shell 中试试这个。

import urlparse 
urlparse.urljoin("https://www.domain-name.com/en/somelist.html","../../en/item-to-scrap.html")

urljoin 行为似乎是有效的。见：https ://www.rfc-editor.org/rfc/rfc1808.html#section-5.2

如果可能的话，你能通过你正在抓取的网站吗？

有了这种理解，解决方案可以是，

操纵网址（删除这两个点和斜线）。在爬虫中生成。基本上覆盖解析或_request_to_folow。

爬虫来源： https ://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py

操作下载中间件中的url，这可能更干净。您删除了下载中间件的 process_request 中的 ../。

下载中间件的文档：http: //scrapy.readthedocs.org/en/0.16/topics/downloader-middleware.html

使用基本蜘蛛并返回您想要进一步抓取的操纵网址请求

basespider 的文档：http: //scrapy.readthedocs.org/en/0.16/topics/spiders.html#basespider

请让我知道，如果你有任何问题。

score 1 · Accepted Answer

多亏了这个答案，我终于找到了解决方案。我使用 process_links 如下：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class Product(Item):
    name = Field()

class siteSpider(CrawlSpider):
    name = "domain-name.com"
    allowed_domains = ['www.domain-name.com']
    start_urls = ["https://www.domain-name.com/en/"]
    rules = (
        Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), process_links='process_links', callback='parse_item', follow=True),
        Rule(SgmlLinkExtractor(allow=('')), process_links='process_links', follow=True),
    )

    def parse_item(self, response):
        x = HtmlXPathSelector(response)
        product = Product()
        product['name'] = ''
        name = x.select('//title/text()').extract()
        if type(name) is list:
            for s in name:
                if s != ' ' and s != '':
                    product['name'] = s
                    break
        return product

    def process_links(self,links):
        for i, w in enumerate(links):
            w.url = w.url.replace("../", "")
            links[i] = w
        return links

python - 避免由于相对 url 而产生的错误请求

2 回答 2

Related

Reference