python - 离线（本地）数据上的 Python Scrapy

Question

我的电脑上有一个 270MB 的数据集（10000 个 html 文件）。我可以使用 Scrapy 在本地爬取这个数据集吗？如何？

score 34 · Accepted Answer

SimpleHTTP 服务器托管

如果你真的想在本地托管它并使用scrapy，你可以通过导航到它存储的目录并运行SimpleHTTPServer（如下所示的8000端口）来提供它：

python -m SimpleHTTPServer 8000

然后将scrapy指向127.0.0.1:8000

$ scrapy crawl 127.0.0.1:8000

文件：//

另一种方法是直接将scrapy指向文件集：

$ scrapy crawl file:///home/sagi/html_files # Assuming you're on a *nix system

包起来

一旦你为 scrapy 设置了爬虫（参见示例 dirbot），只需运行爬虫：

$ scrapy crawl 127.0.0.1:8000

如果 html 文件中的链接是绝对的而不是相对的，那么这些链接可能无法正常工作。您需要自己调整文件。

score 9 · Accepted Answer

转到您的数据集文件夹：

import os
files = os.listdir(os.getcwd())
for file in files:
    with open(file,"r") as f:
        page_content = f.read()
        #do here watever you want to do with page_content. I guess parsing with lxml or Beautiful soup.

无需去 Scrapy ！

python - 离线（本地）数据上的 Python Scrapy

2 回答 2

SimpleHTTP 服务器托管

文件：//

包起来

Related

Reference