java - Java，如何从大文件中提取一些文本并将其导入到较小的文件中

Question

我对 Java 编程比较陌生，并且正在尝试创建一个可以帮助一些同事的应用程序。

我正在尝试做的背景是，读取一个大文件的内容，最多可能超过 400,000 行，其中包含 XML 但不是有效的 XML 文档，就像它的一种日志一样。

我正在尝试做的是构建一个应用程序，用户在其中输入唯一 ID，然后扫描文档以查找它是否存在，如果存在，并且通常唯一 ID 在生成的 XML 中出现几次，然后我想要向后遍历到的节点 ID <documentRequestMessage>，然后将该节点的所有内容复制到其关闭节点，并将其放入它自己的文档中。

我知道如何创建新文档，但我正在努力找出如何从本质上“向后查找”并将所有内容复制到结束标签，非常感谢任何帮助。

编辑

不幸的是，到目前为止，我还无法弄清楚如何实施这三个建议中的任何一个。

correlationId是前面提到的唯一引用。

我拥有的当前代码可以运行并将结果输出到控制台，是

String correlationId = correlationID.getText();
BufferedReader bf = new BufferedReader(new FileReader(f));
System.out.println("Looking for " + correlationId);
int lineCount = 0;
String line;

while ((line = bf.readLine()) != null) {
    lineCount++;
    int indexFound = line.indexOf(correlationId);

    if (indexFound > -1) {
        System.out.println("Found CorrelationID on line " + "\t" + lineCount + "\t" + line);
    }
}

bf.close();

非常感谢任何进一步的帮助，我不是要求有人为我写它，只是一些非常清晰和基本的说明:) 请

编辑 2

可以在此处找到我尝试读取和提取的文件的副本

score 1 · Accepted Answer

在您阅读文件以查找您的唯一 ID 时，请保留对documentRequestMessage您遇到的最新 ID 的引用。当您找到唯一 ID 时，您将拥有提取消息所需的参考。

在这种情况下，“参考”可能意味着几件事。由于您没有遍历 DOM（因为它不是有效的 XML），您可能只会将位置存储在文件中documentRequestMessage。如果您使用的是FileInputStream（或任何支持的InputStream地方mark），您可以只mark/reset来存储并返回到文件中消息开始的位置。

这是我相信您正在寻找的实现。它根据您链接的日志文件做了很多假设，但它适用于示例文件：

private static void processMessages(File file, String correlationId)
{
    BufferedReader reader = null;

    try {
        boolean capture = false;
        StringBuilder buffer = new StringBuilder();
        String lastDRM = null;
        String line;

        reader = new BufferedReader(new FileReader(file));

        while ((line = reader.readLine()) != null) {
            String trimmed = line.trim();

            // Blank lines are boring
            if (trimmed.length() == 0) {
                continue;
            }

            // We only actively look for lines that start with an open
            // bracket (after trimming)
            if (trimmed.startsWith("[")) {
                // Do some house keeping - if we have data in our buffer, we
                // should check it to see if we are interested in it
                if (buffer.length() > 0) {
                    String message = buffer.toString();

                    // Something to note here... at this point you could
                    // create a legitimate DOM Document from 'message' if
                    // you wanted to

                    if (message.contains("documentRequestMessage")) {
                        // If the message contains 'documentRequestMessage'
                        // then we save it for later reference
                        lastDRM = message;
                    } else if (message.contains(correlationId)) {
                        // If the message contains the correlationId we are
                        // after, then print out the last message with the
                        // documentRequestMessage that we found, or an error
                        // if we never saw one.
                        if (lastDRM == null) {
                            System.out.println(
                                    "No documentRequestMessage found");
                        } else {
                            System.out.println(lastDRM);
                        }

                        // In either case, we're done here
                        break;
                    }

                    buffer.setLength(0);
                    capture = false;
                }

                // Based on the log file, the only interesting messages are
                // the ones that are DEBUG
                if (trimmed.contains("DEBUG")) {
                    // Some of the debug messages have the XML declaration
                    // on the same line, and some the line after, so let's
                    // figure out which is which...
                    if (trimmed.endsWith("?>")) {
                        buffer.append(
                                trimmed.substring(
                                    trimmed.indexOf("<?")));
                        buffer.append("\n");
                        capture = true;
                    } else if (trimmed.endsWith("Message:")) {
                        capture = true;
                    } else {
                        System.err.println("Can't handle line: " + trimmed);
                    }
                }
            } else {
                if (capture) {
                    buffer.append(line).append("\n");
                }
            }
        }
    } catch (IOException ex) {
        ex.printStackTrace(System.err);
    } finally {
        if (reader != null) {
            try {
                reader.close();
            } catch (IOException ex) {
                /* Ignore */
            }
        }
    }
}

score 0 · Accepted Answer

您可以做的是读取文件的内容并查找<documentRequestMessage>元素。当您找到上述元素之一时，请阅读直到找到</documentRequestMessage>并将其存储在列表中，这样所有元素都documentRequestMessage将在列表中可用。

您可以在最后或添加到列表时遍历此列表以查找您正在寻找的唯一 ID。如果您发现它写入 XML 文件或忽略。

score 0 · Accepted Answer

我假设您的日志是一系列<documentRequestMessage>内容。

根本不扫描日志。

阅读日志，每次遇到<documentRequestMessage>标题时，开始将该块的内容保存<documentRequestMessage>到块区域中。

我不确定您是否必须解析 XML，或者您可以将其保存为字符串列表。

当您遇到</documentRequestMessage>预告片时，请检查该块的 ID 是否与您要查找的 ID 匹配，

如果 ID 匹配，则将该<documentRequestMessage>块写入输出文件。如果 ID 不匹配，则清除块区域并读取下一个<documentRequestMessage>标题。

这样，您的文件读取就没有回溯。

java - Java，如何从大文件中提取一些文本并将其导入到较小的文件中

3 回答 3

Related

Reference