python - Python Scrapy allowed_damins 属性

Question

我正在学习编写简单的代码，获取堆栈溢出问题的发布信息。

我设置allowed_domains = ["http://stackoverflow.com/questions/]了一个基础蜘蛛。而它的 parse() 方法只返回一个带有格式 url 的 Request。"http://stackoverflow.com/questions/%d/" % no

我认为它会起作用......也许我对 allowed_domain 有误解。parse() 返回的所有请求似乎都被 allowed_domain 过滤了。它仅在我删除 allowed_domain 时才有效。你可以解释吗..？对不起我的琐碎问题。

class StackOverFlowPost(scrapy.Spider):
    startNo = 26200877
    endNo = 26200880
    curNo = 26200877
    name = "stackOverFlowPost"
    start_urls = ["http://stackoverflow.com/questions/%d/" % startNo ]
    allowed_domains = ["http://stackoverflow.com/questions"]
    baseUrl = "http://stackoverflow.com/questions/%d/"

    def parse(self, response):
        itemObj = items.StackOverFlowItem()

        # getting items information from the page
        ...
        yield itemObj

        StackOverFlowPost.curNo += 1
        nextPost = StackOverFlowPost.baseUrl % StackOverFlowPost.curNo  

        yield scrapy.Request(nextPost, callback = self.parse)

score 1 · Accepted Answer

在你的蜘蛛中，allowed_domains应该是domain（不是url）的列表：

allowed_domains = ["stackoverflow.com"]

请注意，您还可以start_urls使用以下列表进行设置url：

start_urls = ["http://stackoverflow.com/questions/%d/" % i for i in range(startNo, endNo+1)]

它使parse()编写变得容易。

python - Python Scrapy allowed_damins 属性

1 回答 1

Related

Reference