java - 在文件系统上使用 crawler4j 获取 html 文件

Question

我正在尝试使用edu.uci.ics.crawler4jlib 从本地目录中的 html 文件中抓取页面。C:/work/temp/test.html是它的路径。

我发现 crawler4j 正在建立 Http 连接。但是对于这种情况，不需要 Http 连接。我还在文件路径前面加上了file://like"file:///C:/work/temp/test.html" (which is accessible)

从 PageFetcher 类的代码：

    SchemeRegistry schemeRegistry = new SchemeRegistry();
    schemeRegistry.register(new Scheme("http", 80, PlainSocketFactory.getSocketFactory()));

    if (config.isIncludeHttpsPages()) {
        schemeRegistry.register(new Scheme("https", 443, SSLSocketFactory.getSocketFactory()));
    }

有没有办法file://在 crawler4j 的 PageFetcher 中的 SchemeRegistry 中注册协议，或者 crawler4j 总是用于服务器上的托管文件？

score 0 · Accepted Answer

它应该是您的本地主机 URL。例如，localhost:80/ 应该是您目录的根目录。URL 应该类似于http://localhost:80/.......

java - 在文件系统上使用 crawler4j 获取 html 文件

1 回答 1

Related

Reference