scrapy - 如何防止 Scrapy 爬虫转换编码字符？

翻译自：https://stackoverflow.com/questions/17125149 2013-06-15T15:25:49.887

294 次

我是 Scrapy 的新手，遇到了一个我找不到答案的问题。

我尝试抓取的网站之一在 URL 中使用了编码字符，特别是 %2F。Scrapy 正在将 %2F 转换为“/”，并且 GET 请求返回 404 错误页面。

奇怪的是，%3D 也在 URL 中，但 Scrapy 不会将其转换为“=”。

以下是页面源代码中的示例 URL：

/example/Product-SKU-512MCTR-T%2FA-Detail/444172?h=5&rr=0.14&hitprm=h%3D

以下是 Scrapy 尝试抓取的内容：

/example/Product-SKU-512MCTR-T/A-Detail/444172?h=5&rr=0.14&hitprm=h%3D

这是蜘蛛代码的一个片段：


class MySpider(CrawlSpider):
    name = "test"
    RANDOMIZE_DOWNLOAD_DELAY = True
    allowed_domains = ["test.com"]
    start_urls = [  
    "http://www.test.com/jsp/results?h=#prows=100&sm=0"
    ]

    rules = (
        Rule (SgmlLinkExtractor(allow=('example', )), callback="parse_auctions", follow= True),
    )

scrapy - 如何防止 Scrapy 爬虫转换编码字符？

0 回答 0

Related

Reference