html-parsing - 如何使用 Tika 从 html 中提取主要文本

Question

我只想知道如何使用 Tika 从 html 中提取正文和纯文本？

也许一种可能的解决方案是使用 BoilerPipeContentHandler 但你有一些示例/演示代码来展示它吗？

首先十分感谢

score 4 · Accepted Answer

BodyContentHandler 类不使用 Boilerpipe 代码，因此您必须显式使用 BoilerPipeContentHandler。以下代码对我有用：

public String[] tika_autoParser() {
    String[] result = new String[3];
    try {
        InputStream input = new FileInputStream(new File("test.html"));
        ContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        parser.parse(input, new BoilerpipeContentHandler(textHandler), metadata, context);
        result[0] = "Title: " + metadata.get(metadata.TITLE);
        result[1] = "Body: " + textHandler.toString();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

    return result;
}

score 2 · Accepted Answer

这是一个示例：

public String[] tika_autoParser() {
    String[] result = new String[3];
    try {
        InputStream input = new FileInputStream(new File("/Users/nazanin/Books/Web crawler.pdf"));
        ContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        parser.parse(input, textHandler, metadata, context);
        result[0] = "Title: " + metadata.get(metadata.TITLE);
        result[1] = "Body: " + textHandler.toString();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

    return result;
}

html-parsing - 如何使用 Tika 从 html 中提取主要文本

2 回答 2

Related

Reference