python - 使用保存在本地系统中的 html 抓取文件

Question

例如我有一个站点"www.example.com" 实际上我想通过保存到本地系统来抓取该站点的 html。所以为了测试我将该页面保存在我的桌面上example.html

现在我已经为此编写了蜘蛛代码，如下所示

class ExampleSpider(BaseSpider):
   name = "example"
   start_urls = ["example.html"]

   def parse(self, response):
       print response
       hxs = HtmlXPathSelector(response)

但是当我运行上面的代码时，我收到如下错误

ValueError: Missing scheme in request url: example.html

最后，我的意图是抓取example.html包含www.example.com保存在本地系统中的 html 代码的文件

任何人都可以建议我如何在 start_urls 中分配 example.html 文件

提前致谢

score 33 · Accepted Answer

33

您可以使用以下形式的 url 抓取本地文件：

 file:///path/to/file.html

于 2014-03-05T19:56:23.510 回答

score 14 · Accepted Answer

您可以使用 HTTPCacheMiddleware，这将使您能够从缓存中运行蜘蛛。HTTPCacheMiddleware 设置的文档位于此处。

基本上，将以下设置添加到您的 settings.py 将使其工作：

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Set to 0 to never expire

然而，这需要从网络上运行初始爬虫来填充缓存。

score 5 · Accepted Answer

在 scrapy 中，您可以使用以下方法抓取本地文件：

class ExampleSpider(BaseSpider):
   name = "example"
   start_urls = ["file:///path_of_directory/example.html"]

   def parse(self, response):
       print response
       hxs = HtmlXPathSelector(response)

我建议你使用 scrapy shell 'file:///path_of_directory/example.html' 检查它

score 2 · Accepted Answer

只是为了分享我喜欢用本地文件进行这种抓取的方式：

import scrapy
import os

LOCAL_FILENAME = 'example.html'
LOCAL_FOLDER = 'html_files'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))


class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = [
        f"file://{BASE_DIR}/{LOCAL_FOLDER}/{LOCAL_FILENAME}"
    ]

我正在使用 f-strings (python 3.6+)( https://www.python.org/dev/peps/pep-0498/ )，但是您可以根据需要使用 %-formatting 或 str.format() 进行更改.

score 1 · Accepted Answer

scrapy shell "file:E:\folder\to\your\script\Scrapy\teste1\teste1.html"

这适用于我今天在 Windows 10 上。我必须放置不带 //// 的完整路径。

score 0 · Accepted Answer

你可以简单地做

def start_requests(self):
    yield Request(url='file:///path_of_directory/example.html')

score -6 · Accepted Answer

如果您查看 scrapy Request 的源代码，例如github。您可以了解scrapy向http服务器发送请求并从服务器获取所需页面作为响应。您的文件系统不是 http 服务器。为了使用scrapy进行测试，您必须设置http服务器。然后你可以将url分配给scrapy

http://127.0.0.1/example.html

python - 使用保存在本地系统中的 html 抓取文件

7 回答 7

Related

Reference