0

How to retrive this <DIV> with id 48227783 value using Apache TIKA ?

<div class="postcolor post_text" data-postid="48227783">Ownage!<br /></div>

I try to retreive the value 'Ownage!' , I tried to use mapSafeElement , DefaultHtmlMapper objects seems cannot find it anywhere.

Thanks.

4

1 回答 1

0

我将覆盖 mapSafeElement、mapSafeAttribute 和 isDiscardElement 方法以在解析期间访问此元素,因为 Tika 可能会拒绝非标准/非“安全”属性“data-postid” - 如下所示。

然后,您将通过 ParseContext 对象使用此类,如下所示:

InputStream input = <your Uri/file/string input stream>;
ParseContext parseContext = new ParseContext();
parseContext.set(HtmlMapper.class, AllTagMapper.class.newInstance());       

HtmlParser parser = new HtmlParser();
parser.parse(input, new ContentHandler(), new Metadata(), parseContext);

// Override HtmlMapper to process all tags and tributes. 

class AllTagMapper implements HtmlMapper {

    @Override
    public String mapSafeElement(String name) {
        return name.toLowerCase();
    }

    @Override
    public boolean isDiscardElement(String name) {
        return false;
    }

    @Override
    public String mapSafeAttribute(String elementName, String attributeName) {
        return attributeName.toLowerCase();
    }

}
于 2013-10-16T09:19:56.073 回答