我将覆盖 mapSafeElement、mapSafeAttribute 和 isDiscardElement 方法以在解析期间访问此元素,因为 Tika 可能会拒绝非标准/非“安全”属性“data-postid” - 如下所示。
然后,您将通过 ParseContext 对象使用此类,如下所示:
InputStream input = <your Uri/file/string input stream>;
ParseContext parseContext = new ParseContext();
parseContext.set(HtmlMapper.class, AllTagMapper.class.newInstance());
HtmlParser parser = new HtmlParser();
parser.parse(input, new ContentHandler(), new Metadata(), parseContext);
// Override HtmlMapper to process all tags and tributes.
class AllTagMapper implements HtmlMapper {
@Override
public String mapSafeElement(String name) {
return name.toLowerCase();
}
@Override
public boolean isDiscardElement(String name) {
return false;
}
@Override
public String mapSafeAttribute(String elementName, String attributeName) {
return attributeName.toLowerCase();
}
}