java - html文件的lucene索引

Question

亲爱的用户我正在使用 apache lucene 进行索引和搜索。我必须索引存储在计算机本地磁盘上的 html 文件。我必须对 html 文件的文件名和内容进行索引。我能够将文件名存储在 lucene 索引中，但不能存储 html 文件内容，它不仅应该索引数据，还应该索引包含图像链接和 url 的整个页面，以及如何从这些索引文件中访问内容以进行索引我正在使用以下代码：

    File indexDir = new File(indexpath);
    File dataDir = new File(datapath);
    String suffix = ".htm";
    IndexWriter indexWriter = new IndexWriter(
            FSDirectory.open(indexDir),
            new SimpleAnalyzer(),
            true,
            IndexWriter.MaxFieldLength.LIMITED);
    indexWriter.setUseCompoundFile(false);
    indexDirectory(indexWriter, dataDir, suffix);

    numIndexed = indexWriter.maxDoc();
    indexWriter.optimize();
    indexWriter.close();


private void indexDirectory(IndexWriter indexWriter, File dataDir, String suffix) throws IOException {
    try {
        for (File f : dataDir.listFiles()) {
            if (f.isDirectory()) {
                indexDirectory(indexWriter, f, suffix);
            } else {
                indexFileWithIndexWriter(indexWriter, f, suffix);
            }
        }
    } catch (Exception ex) {
        System.out.println("exception 2 is" + ex);
    }
}

private void indexFileWithIndexWriter(IndexWriter indexWriter, File f,
    String suffix) throws IOException {
    try {
        if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) {
            return;
        }
        if (suffix != null && !f.getName().endsWith(suffix)) {
            return;
        }
        Document doc = new Document();
        doc.add(new Field("contents", new FileReader(f)));
        doc.add(new Field("filename", f.getFileName(),
                Field.Store.YES, Field.Index.ANALYZED));
        indexWriter.addDocument(doc);
    } catch (Exception ex) {
        System.out.println("exception 4 is" + ex);
    }
}

提前致谢

score 12 · Accepted Answer

这行代码是您的内容未被存储的原因：

doc.add(new Field("contents", new FileReader(f)));

此方法不存储被索引的内容。

如果您尝试索引 HTML 文件，请尝试使用JTidy。这将使该过程变得更加容易。

示例代码：

public class JTidyHTMLHandler {

    public org.apache.lucene.document.Document getDocument(InputStream is) throws DocumentHandlerException {
        Tidy tidy = new Tidy();
        tidy.setQuiet(true);
        tidy.setShowWarnings(false);
        org.w3c.dom.Document root = tidy.parseDOM(is, null);
        Element rawDoc = root.getDocumentElement();

        org.apache.lucene.document.Document doc =
                new org.apache.lucene.document.Document();

        String body = getBody(rawDoc);

        if ((body != null) && (!body.equals(""))) {
            doc.add(new Field("contents", body, Field.Store.NO, Field.Index.ANALYZED));
        }

        return doc;
    }

    protected String getTitle(Element rawDoc) {
        if (rawDoc == null) {
            return null;
        }

        String title = "";

        NodeList children = rawDoc.getElementsByTagName("title");
        if (children.getLength() > 0) {
            Element titleElement = ((Element) children.item(0));
            Text text = (Text) titleElement.getFirstChild();
            if (text != null) {
                title = text.getData();
            }
        }
        return title;
    }

    protected String getBody(Element rawDoc) {
        if (rawDoc == null) {
            return null;
        }

        String body = "";
        NodeList children = rawDoc.getElementsByTagName("body");
        if (children.getLength() > 0) {
            body = getText(children.item(0));
        }
        return body;
    }

    protected String getText(Node node) {
        NodeList children = node.getChildNodes();
        StringBuffer sb = new StringBuffer();
        for (int i = 0; i < children.getLength(); i++) {
            Node child = children.item(i);
            switch (child.getNodeType()) {
                case Node.ELEMENT_NODE:
                    sb.append(getText(child));
                    sb.append(" ");
                    break;
                case Node.TEXT_NODE:
                    sb.append(((Text) child).getData());
                    break;
            }
        }
        return sb.toString();
    }
}

从 URL 获取 InputStream：

URL url = new URL(htmlURLlocation);
URLConnection connection = url.openConnection();
InputStream stream = connection.getInputStream();

要从文件中获取 InputStream：

InputStream stream = new FileInputStream(new File (htmlFile));

java - html文件的lucene索引

1 回答 1

Related

Reference