java - 使用 tika 解析器的 XPath 应用程序

Question

我想清理不规则的网页内容——（可能是 html、pdf 图像等），主要是 html。我为此使用 tika 解析器。但我不知道如何应用我在 html 清洁器中使用的 xpath。

我使用的代码是，

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
URL u = new URL("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-    drop-moment-in-drag-and-drop");
new HtmlParser().parse(u.openStream(),handler, metadata, context);
System.out.println(handler.toString());

但在这种情况下，我没有得到任何输出。但是对于 url-google.com，我得到了输出。

无论哪种情况，我都不知道如何应用 xpath。

有什么想法请...

尝试将我的自定义 xpath 作为正文内容处理程序的使用方式，

HttpClient client = new HttpClient();
        GetMethod method = new GetMethod("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-drop-moment-in-drag-and-drop");
        int status = client.executeMethod(method);
        HtmlParser parse = new HtmlParser();
        XPathParser parser = new XPathParser("xhtml", "http://www.w3.org/1999/xhtml");          
        //Matcher matcher = parser.parse("/xhtml:html/xhtml:body/descendant:node()");
       Matcher matcher = parser.parse("/html/body//h1");        
ContentHandler textHandler = new MatchingContentHandler(new WriteOutContentHandler(), matcher);
        Metadata metadata = new Metadata(); 
        ParseContext context = new ParseContext();
        parse.parse(method.getResponseBodyAsStream(), textHandler,metadata ,context);   
        System.out.println("content: " + textHandler.toString());

但没有得到给定xpath中的内容..

score 2 · Accepted Answer

我建议您查看Tika 附带的BodyContentHandler的源代码。BodyContentHandler 仅返回 body 标记内的 xml，基于 xpath

不过，一般来说，您应该使用MatchingContentHandler来用 XPath 包装您选择的 ContentHandler，这就是 BodyContentHandler 在内部所做的。

java - 使用 tika 解析器的 XPath 应用程序

1 回答 1

Related

Reference