java - 如何优化倒排文件以进行全文索引？

Question

我正在制作一个简单的程序，在其中使用 PDF 文件样本在我的数据库上构建全文索引。这个想法是我阅读每个 PDF 文件，提取单词并将它们存储在哈希集中。

然后，将循环中的每个单词及其文件路径添加到 MySQL 中的表中。因此，每个单词都会循环存储在每一列中，直到完成。它工作得很好。但是，当涉及到包含数千个单词的大型PDF文件时，建立索引表可能需要一些时间。换句话说，将每个单词保存到数据库中需要很长时间，因为提取单词的速度很快。

代码：

public class IndexTest {

public static void main(String[] args) throws Exception {
    // write your code here
    //String path ="D:\\Full Text Indexing\\testIndex\\bell2009a.pdf";
    // HashSet<String> uniqueWords = new HashSet<>();
    /*StopWatch stopwatch = new StopWatch();
    stopwatch.start();*/
    File folder = new File("D:\\PDF1");
    File[] listOfFiles = folder.listFiles();

    for (File file : listOfFiles) {
        if (file.isFile()) {
            HashSet<String> uniqueWords = new HashSet<>();
            String path = "D:\\PDF1\\" + file.getName();
            try (PDDocument document = PDDocument.load(new File(path))) {

                if (!document.isEncrypted()) {

                    PDFTextStripper tStripper = new PDFTextStripper();
                    String pdfFileInText = tStripper.getText(document);
                    String lines[] = pdfFileInText.split("\\r?\\n");
                    for (String line : lines) {
                        String[] words = line.split(" ");

                        for (String word : words) {
                            uniqueWords.add(word);

                        }

                    }
                    // System.out.println(uniqueWords);

                }
            } catch (IOException e) {
                System.err.println("Exception while trying to read pdf document - " + e);
            }
            Object[] words = uniqueWords.toArray();
            String unique = uniqueWords.toString();
            //  System.out.println(words[1].toString());



            for(int i = 1 ; i <= words.length - 1 ; i++ ) {
                MysqlAccessIndex connection = new MysqlAccessIndex();
                connection.readDataBase(path, words[i].toString());

            }

            System.out.println("Completed");

        }
    }

SQL连接代码：

 public class MysqlAccessIndex {

      public MysqlAccessIndex() throws Exception {
        Class.forName("com.mysql.jdbc.Driver");
        connect = DriverManager
                .getConnection("jdbc:mysql://126.32.3.178/fulltext_ltat?"
                        + "user=root&password=root123");
      //  statement = connect.createStatement();
        System.out.print("Connected");
    }


    public void readDataBase(String path,String word) throws Exception {
        try {




            statement = connect.createStatement();
            System.out.print("Connected");


            preparedStatement = connect
                    .prepareStatement("insert IGNORE into  fulltext_ltat.test_text values (?, ?) ");

            preparedStatement.setString(1, path);
            preparedStatement.setString(2, word);
            preparedStatement.executeUpdate();
            // resultSet = statement
            //.executeQuery("select * from fulltext_ltat.index_detail");



            //  writeResultSet(resultSet);
        } catch (Exception e) {
            throw e;
        } finally {
            close();
        }

    }

有什么建议可以改善或优化性能问题吗？

score 1 · Accepted Answer

问题在于以下代码：

// This will load the MySQL driver, each DB has its own driver
Class.forName("com.mysql.jdbc.Driver");
// Setup the connection with the DB
connect = DriverManager.getConnection(
        "jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");

您正在为插入数据库的每个单词重新创建连接。更好的方法是这样的：

public MysqlAccess() {
    connect = DriverManager
                .getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
                        + "user=root&password=root");
}

这样，您只connect在第一次创建该类的实例时创建。在您的main方法内部，您必须在MysqlAccessfor 循环之外创建实例，因此它只会创建一次。

MysqlAccess看起来像这样：

public class MysqlAccess {

    private Connection connect = null;
    private Statement statement = null;
    private PreparedStatement preparedStatement = null;
    private ResultSet resultSet = null;

    public MysqlAccess() {
        // Setup the connection with the DB
        connect = DriverManager.getConnection(
                "jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");
    }

    public void readDataBase(String path, String word) throws Exception {
        try {
            // Statements allow to issue SQL queries to the database
            statement = connect.createStatement();
            System.out.print("Connected");
            // Result set get the result of the SQL query

            preparedStatement = connect.prepareStatement(
                    "insert IGNORE into  fulltext_ltat.test_text values (default,?, ?) ");

            preparedStatement.setString(1, path);
            preparedStatement.setString(2, word);
            preparedStatement.executeUpdate();

        } catch (Exception e) {
            throw e;
        } finally {
            close();
        }

    }

    private void writeResultSet(ResultSet resultSet) throws SQLException {
        // ResultSet is initially before the first data set
        while (resultSet.next()) {
            // It is possible to get the columns via name
            // also possible to get the columns via the column number
            // which starts at 1
            // e.g. resultSet.getSTring(2);
            String path = resultSet.getString("path");
            String word = resultSet.getString("word");

            System.out.println();
            System.out.println("path: " + path);
            System.out.println("word: " + word);

        }
    }
}

java - 如何优化倒排文件以进行全文索引？

1 回答 1

Related

Reference