search-engine - 如何使用 Apache Nutch 保存原始 html 文件

Question

我是搜索引擎和网络爬虫的新手。现在我想将特定网站中的所有原始页面存储为 html 文件，但是使用 Apache Nutch 我只能获取二进制数据库文件。如何使用 Nutch 获取原始 html 文件？

Nutch 支持吗？如果没有，我还可以使用哪些其他工具来实现我的目标。（支持分布式爬取的工具更好。）

score 9 · Accepted Answer

嗯，nutch 会将爬取的数据以二进制形式写入，因此如果您希望将其保存为 html 格式，则必须修改代码。（如果您是 nutch 新手，这会很痛苦）。

如果您想要获取 html 页面的快速简便的解决方案：

如果您打算拥有的页面/url 列表非常低，那么最好使用wget为每个 url 调用的脚本来完成它。
或使用HTTrack工具。

编辑：

写一个你自己的 nutch 插件会很棒。您的问题将得到解决，而且您可以通过提交您的工作为 nutch 做出贡献！！！如果您是 nutch 新手（在代码和设计方面），那么您将不得不投入大量时间来构建一个新插件……否则它很容易做到。

帮助您主动的几点建议：

这是一个讨论编写自己的 nutch 插件的页面。

从Fetcher.java开始。见第 647-648 行。这是您可以基于每个 url 获取获取的内容的地方（对于那些成功获取的页面）。

pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS);
updateStatus(content.getContent().length);

您应该在此之后添加代码以调用您的插件。将对象传递content给它。到目前为止，您已经猜到这content.getContent()是您想要的 url 的内容。在插件代码中，将其写入某个文件。文件名应该基于 url 名称，否则很难使用它。网址可以通过fit.url.

score 6 · Accepted Answer

您必须在 Eclipse中运行Nutch进行修改。

当您能够运行时，打开 Fetcher.java 并在“内容保护程序”命令行之间添加行。

case ProtocolStatus.SUCCESS:        // got a page
            pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
            updateStatus(content.getContent().length);'


            //------------------------------------------- content saver ---------------------------------------------\\
            String filename = "savedsites//" + content.getUrl().replace('/', '-');  

            File file = new File(filename);
            file.getParentFile().mkdirs();
            boolean exist = file.createNewFile();
            if (!exist) {
                System.out.println("File exists.");
            } else {
                FileWriter fstream = new FileWriter(file);
                BufferedWriter out = new BufferedWriter(fstream);
                out.write(content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));
                out.close();
                System.out.println("File created successfully.");
            }
            //------------------------------------------- content saver ---------------------------------------------\\

score 6 · Accepted Answer

要更新此答案-

可以从您的 crawldb 段文件夹中后处理数据，并直接读取 html（包括 nutch 存储的其他数据）。

    Configuration conf = NutchConfiguration.create();
    FileSystem fs = FileSystem.get(conf);

    Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
    SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);

    try
    {
            Text key = new Text();
            Content content = new Content();

            while (reader.next(key, content)) 
            {
                    System.out.println(new String(content.GetContent()));
            }
    }
    catch (Exception e)
    {

    }

score 0 · Accepted Answer

0

这里的答案已经过时了。现在，很容易获得带有nutch dump. 请看这个答案。

于 2018-03-12T08:49:52.860 回答

score -1 · Accepted Answer

在 apache Nutch 2.3.1
中，您可以通过编辑 Nutch 代码来保存原始 HTML，首先按照https://wiki.apache.org/nutch/RunNutchInEclipse在 eclipse 中运行 nutch

在 Eclipse 编辑文件 FetcherReducer.java 中运行完 nutch 后，将此代码添加到输出方法中，再次运行 ant eclipse 以重建类

最后，原始 html 将添加到数据库中的 reportUrl 列

if (content != null) {
ByteBuffer raw = fit.page.getContent();
if (raw != null) {
    ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining());
    Scanner scanner = new Scanner(arrayInputStream);
    scanner.useDelimiter("\\Z");//To read all scanner content in one String
    String data = "";
    if (scanner.hasNext()) {
        data = scanner.next();
    }
    fit.page.setReprUrl(StringUtil.cleanField(data));
    scanner.close();
}

search-engine - 如何使用 Apache Nutch 保存原始 html 文件

5 回答 5

Related

Reference