java - 按最大行拆分非常大的文本文件

Question

我想将一个包含字符串的大文件拆分为一组新的（较小的）文件并尝试使用 nio2。

我不想将整个文件加载到内存中，所以我用 BufferedReader 进行了尝试。

较小的文本文件应受文本行数的限制。

该解决方案有效，但是我想问是否有人知道使用 java 8（可能是带有 stream()-api 的 lamdas？）和 nio2 的性能更好的解决方案：

public void splitTextFiles(Path bigFile, int maxRows) throws IOException{

        int i = 1;
        try(BufferedReader reader = Files.newBufferedReader(bigFile)){
            String line = null;
            int lineNum = 1;

            Path splitFile = Paths.get(i + "split.txt");
            BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);

            while ((line = reader.readLine()) != null) {

                if(lineNum > maxRows){
                    writer.close();
                    lineNum = 1;
                    i++;
                    splitFile = Paths.get(i + "split.txt");
                    writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
                }

                writer.append(line);
                writer.newLine();
                lineNum++;
            }

            writer.close();
        }
}

score 4 · Accepted Answer

注意直接使用/及其子类与/的工厂方法之间的区别。在前一种情况下，当没有给出明确的字符集时使用系统的默认编码，而后者总是默认为. 因此，我强烈建议始终指定所需的字符集，即使它是or来记录您的意图，并在您在创建or的各种方式之间切换时避免意外。 InputStreamReaderOutputStreamWriterReaderWriter FilesUTF-8Charset.defaultCharset()StandardCharsets.UTF_8ReaderWriter

如果要在行边界处拆分，则无法查看文件的内容。所以你不能像合并时那样优化它。

如果您愿意牺牲可移植性，您可以尝试一些优化。如果您知道字符集编码将明确映射'\n'到(byte)'\n'大多数单字节编码的情况，并且UTF-8您可以扫描字节级别的换行符以获取拆分的文件位置并避免从您的应用程序传输任何数据到 I/O 系统。

public void splitTextFiles(Path bigFile, int maxRows) throws IOException {
    MappedByteBuffer bb;
    try(FileChannel in = FileChannel.open(bigFile, READ)) {
        bb=in.map(FileChannel.MapMode.READ_ONLY, 0, in.size());
    }
    for(int start=0, pos=0, end=bb.remaining(), i=1, lineNum=1; pos<end; lineNum++) {
        while(pos<end && bb.get(pos++)!='\n');
        if(lineNum < maxRows && pos<end) continue;
        Path splitFile = Paths.get(i++ + "split.txt");
        // if you want to overwrite existing files use CREATE, TRUNCATE_EXISTING
        try(FileChannel out = FileChannel.open(splitFile, CREATE_NEW, WRITE)) {
            bb.position(start).limit(pos);
            while(bb.hasRemaining()) out.write(bb);
            bb.clear();
            start=pos;
            lineNum = 0;
        }
    }
}

缺点是它不适用于UTF-16or之类的编码，EBCDIC并且不像旧 MacOS9 中使用的BufferedReader.readLine()那样支持 lone'\r'作为行终止符。

此外，它仅支持小于 2GB 的文件；由于虚拟地址空间有限，32 位 JVM 上的限制可能更小。对于大于限制的文件，有必要map逐个迭代源文件的块和它们。

这些问题可以解决，但会增加这种方法的复杂性。考虑到我的机器上的速度提升只有 15% 左右（我没想到更多，因为这里 I/O 占主导地位）并且当复杂性增加时会更小，我认为这不值得。

最重要的是，对于这个任务，Reader/Writer方法就足够了，但你应该注意Charset用于操作的方法。

score 1 · Accepted Answer

我对@nimo23 代码做了一些修改，考虑到为每个拆分文件添加页眉和页脚的选项，它还将文件输出到与原始文件同名的目录中，并附加了 _split . 下面的代码：

public static void splitTextFiles(String fileName, int maxRows, String header, String footer) throws IOException
    {
        File bigFile = new File(fileName);
        int i = 1;
        String ext = fileName.substring(fileName.lastIndexOf("."));

        String fileNoExt = bigFile.getName().replace(ext, "");
        File newDir = new File(bigFile.getParent() + "\\" + fileNoExt + "_split");
        newDir.mkdirs();
        try (BufferedReader reader = Files.newBufferedReader(Paths.get(fileName)))
        {
            String line = null;
            int lineNum = 1;
            Path splitFile = Paths.get(newDir.getPath() + "\\" +  fileNoExt + "_" + String.format("%03d", i) + ext);
            BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
            while ((line = reader.readLine()) != null)
            {
                if(lineNum == 1)
                {
                    writer.append(header);
                    writer.newLine();
                }
                writer.append(line);
                writer.newLine();
                lineNum++;
                if (lineNum > maxRows)
                {
                    writer.append(footer);
                    writer.close();
                    lineNum = 1;
                    i++;
                    splitFile = Paths.get(newDir.getPath() + "\\" + fileNoExt + "_" + String.format("%03d", i) + ext);
                    writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
                }
            }
            if(lineNum <= maxRows) // early exit
            {
                writer.append(footer);
            }
            writer.close();
        }

        System.out.println("file '" + bigFile.getName() + "' split into " + i + " files");
    }

java - 按最大行拆分非常大的文本文件

2 回答 2

Related

Reference