python - 如何避免scrapy忽略哈希标签

Question

我正在研究scrapy

我有一个包含哈希标签的网站，但是当我运行它时，通过忽略哈希标签来下载响应

例如，这是带有的 url hash fragments，url="www.example.com/hash-tag.php#user_id-654" 并且来自该请求的响应仅为www.example.com/hash-tag.php，但我想用哈希片段刮取 url。

我的代码如下

 class ExampleSpider(BaseSpider):
     name = "example"
     domain_name = "www.example.com"


    def start_requests(self):
            return Request("www.example.com/hash-tag.php#user_id-654")    


    def parse(self):
           print response

结果：

<GET www.example.com/hash-tag.php>

我该怎么做......提前谢谢......

score 0 · Accepted Answer

你想要做的事情并不容易。要实现您想要的，您需要一个完整的 DOM 和 JavaScript 引擎，即（可能是无头的）浏览器。

如果你真的需要它，看看PhantomJS。它是 WebKit 引擎，但完全无头。我不确定是否可以轻松扩展 scrapy，但如果你真的想执行 JavaScript（在这种情况下你需要它），使用 PhantomJS 可能是要走的路。

score 0 · Accepted Answer

好吧，如果您真的需要该信息，则可以在调用 Request 之前拆分字符串，并将该信息作为元数据发送。

就像是

url = "www.example.com/hash-tag.php#user_id-654"
hash = url.split("#")[1]

request = Request(url, callback=self.parse_something)
request.meta['after_hash'] = hash
yield request

然后在解析中获取并使用它

def parse_something(self, response):
     hash = response.meta['after_hash']

也就是说，如果您只需要哈希符号后的信息。

python - 如何避免scrapy忽略哈希标签

2 回答 2

Related

Reference