java - 如何使用 Apache Tika 编写自定义 ContentHandler？

Question

我想使用 Apache Tika 从 HTML 文件中提取一些标签内的文本，如 , 等<dt>。<dd>

所以我正在编写ContentHandler应该从这些标签中提取信息的自定义。

我的自定义ContentHandler代码如下所示。它尚未完成，但已无法按预期工作：

public class TableContentHandler implements ContentHandler {

    // key = abbreviation
    // value = information / description for abbreviation
    private Map<String, String> abbreviations = new HashMap<String, String>();

    // current abbreviation
    private String abbreviation = null;

    // <dd> element contains abbreviation. So this boolean variable will be set when
    // <dd> element is found
    private boolean ddElementStarted = false;

    // this method is not giving contents within <dd> and </dd> tags
    public void characters(char[] chars, int arg1, int arg2) throws SAXException {
            if(ddElementStarted) {
                    System.out.println("chars found...");
            }
    }

    // set boolean ddElementStarted to true to indicate that content handler found 
    // <dd> element
    public void startElement(String arg0, String element, String arg2, Attributes arg3) throws SAXException {
            if(element.equalsIgnoreCase("dd")) {
                    ddElementStarted = true;
            }
    }
}

在这里我的假设是，一旦内容处理程序进入startElement()方法和元素名称，dd我将设置ddElementStarted = true然后获取内部<dd>和</dd>元素的内容，我将签入characters()方法。

在characters()方法中，我正在检查ddElementStarted = true和chars数组是否将包含<dd>和</dd>元素中的内容，但它不起作用:(

我想知道是否

我是否朝着正确的方向前进？
这是使用 Tika 解析 HTML 的正确方法吗？或者还有其他方法吗？
我应该选择像 JSoup 这样的其他 HTML 解析 API 吗？我只需要来自几个标签的信息，比如我对 HTML 页面的其余部分不感兴趣。
有没有办法XPath在 Apache Tika 中指定表达式？我无法在Tika in Action书中找到这些信息。

score 1 · Accepted Answer

简单的解决方案是Jsoup。我们可以轻松地获取任何标签内的值。所以不用编写新的 ContentHandler 只需使用 JSoup 来解析。

java - 如何使用 Apache Tika 编写自定义 ContentHandler？

1 回答 1

Related

Reference