python-2.7 - Python Scrapy，LinkExtracotr 不适用于某些特定的 url 重定向

Question

实际上我是 Web 和 Scrapy 的新手......所以如果我的问题很愚蠢，请理解。

这是我想要的， (A)http://www.seoultech.ac.kr/包括一个链接的 URL (B) ctl.seoultech.ac.kr。(B) 的域是 (A) 的子域

而我start_urls的是 (A)，并且比using allow_domains=(B) 的LinkExtractor，爬虫只提取一页 (B)，并且

其次，由于页面 (B) 还包含一些带有其域的 URL，我希望它会提取 (B) 中包含的 URL，但它不起作用，只能抓取 (B)。

URL (B) 被重定向到，http://ctl.seoultech.ac.kr/web/index.php但我知道 Scrapy 自己处理它，我认为这不是问题。

以下是我的简单代码。

class SeoulTech(CrawlSpider):
    name = 'seoulTech'
    start_urls = ['http://www.seoultech.ac.kr/']
    allowed_domains = ['seoultech.ac.kr']
    rules = (
                Rule(LinkExtractor(allow_domains=("ctl.seoultech.ac.kr",)), callback="parse_item", follow=True),
             )

    def parse_item(self, response):
        itemObj = items.SeoulTechItem()
        itemObj['url'] = response.url
        yield itemObj  # pipeline just store URL as json format

score 0 · Accepted Answer

正如您所说，URL (B) 被重定向到http://ctl.seoultech.ac.kr/web/index.php. 所以 LinkExtractor 肯定不会处理 URL (B) 的页面。

python-2.7 - Python Scrapy，LinkExtracotr 不适用于某些特定的 url 重定向

1 回答 1

Related

Reference