17

我的硬盘上有两个(每个 2GB)文件,想将它们相互比较:

  • 使用 Windows 资源管理器复制原始文件大约需要。2-4 分钟(即在同一个物理和逻辑磁盘上读取和写入)。
  • 读取java.io.FileInputStream两次并逐字节比较字节数组需要 20 多分钟。
  • java.io.BufferedInputStream缓冲区为 64kb,文件以块的形式读取,然后进行比较。
  • 比较完成是一个紧密的循环,如

    int numRead = Math.min(numRead[0], numRead[1]);
    for (int k = 0; k < numRead; k++)
    {
       if (buffer[1][k] != buffer[0][k])
       {
          return buffer[0][k] - buffer[1][k];
       }
    }
    

我能做些什么来加快速度?NIO 应该比普通流更快吗?Java 是否无法使用 DMA/SATA 技术,而是执行一些缓慢的 OS-API 调用?

编辑:
感谢您的回答。我根据它们做了一些实验。正如安德烈亚斯所展示的

流或nio方法没有太大区别。
更重要的是正确的缓冲区大小。

我自己的实验证实了这一点。由于文件是大块读取的,因此即使是额外的缓冲区 ( BufferedInputStream) 也不会提供任何东西。优化比较是可能的,我通过 32 倍展开获得了最好的结果,但是与磁盘读取相比,比较花费的时间很小,因此加速很小。看起来我无能为力;-(

4

10 回答 10

16

我尝试了三种不同的方法来比较两个相同的 3,8 gb 文件,缓冲区大小在 8 kb 和 1 MB 之间。第一种方法只使用了两个缓冲的输入流

第二种方法使用一个线程池,它读取两个不同的线程并在第三个线程中进行比较。这以高 CPU 利用率为代价获得了稍高的吞吐量。线程池的管理需要大量的开销来处理那些短期运行的任务。

第三种方法使用 nio,由 laginimaineb 发布

如您所见,一般方法没有太大区别。更重要的是正确的缓冲区大小。

奇怪的是,我使用线程少读了 1 个字节。我无法发现错误。

comparing just with two streams
I was equal, even after 3684070360 bytes and reading for 704813 ms (4,98MB/sec * 2) with a buffer size of 8 kB
I was equal, even after 3684070360 bytes and reading for 578563 ms (6,07MB/sec * 2) with a buffer size of 16 kB
I was equal, even after 3684070360 bytes and reading for 515422 ms (6,82MB/sec * 2) with a buffer size of 32 kB
I was equal, even after 3684070360 bytes and reading for 534532 ms (6,57MB/sec * 2) with a buffer size of 64 kB
I was equal, even after 3684070360 bytes and reading for 422953 ms (8,31MB/sec * 2) with a buffer size of 128 kB
I was equal, even after 3684070360 bytes and reading for 793359 ms (4,43MB/sec * 2) with a buffer size of 256 kB
I was equal, even after 3684070360 bytes and reading for 746344 ms (4,71MB/sec * 2) with a buffer size of 512 kB
I was equal, even after 3684070360 bytes and reading for 669969 ms (5,24MB/sec * 2) with a buffer size of 1024 kB
comparing with threads
I was equal, even after 3684070359 bytes and reading for 602391 ms (5,83MB/sec * 2) with a buffer size of 8 kB
I was equal, even after 3684070359 bytes and reading for 523156 ms (6,72MB/sec * 2) with a buffer size of 16 kB
I was equal, even after 3684070359 bytes and reading for 527547 ms (6,66MB/sec * 2) with a buffer size of 32 kB
I was equal, even after 3684070359 bytes and reading for 276750 ms (12,69MB/sec * 2) with a buffer size of 64 kB
I was equal, even after 3684070359 bytes and reading for 493172 ms (7,12MB/sec * 2) with a buffer size of 128 kB
I was equal, even after 3684070359 bytes and reading for 696781 ms (5,04MB/sec * 2) with a buffer size of 256 kB
I was equal, even after 3684070359 bytes and reading for 727953 ms (4,83MB/sec * 2) with a buffer size of 512 kB
I was equal, even after 3684070359 bytes and reading for 741000 ms (4,74MB/sec * 2) with a buffer size of 1024 kB
comparing with nio
I was equal, even after 3684070360 bytes and reading for 661313 ms (5,31MB/sec * 2) with a buffer size of 8 kB
I was equal, even after 3684070360 bytes and reading for 656156 ms (5,35MB/sec * 2) with a buffer size of 16 kB
I was equal, even after 3684070360 bytes and reading for 491781 ms (7,14MB/sec * 2) with a buffer size of 32 kB
I was equal, even after 3684070360 bytes and reading for 317360 ms (11,07MB/sec * 2) with a buffer size of 64 kB
I was equal, even after 3684070360 bytes and reading for 643078 ms (5,46MB/sec * 2) with a buffer size of 128 kB
I was equal, even after 3684070360 bytes and reading for 865016 ms (4,06MB/sec * 2) with a buffer size of 256 kB
I was equal, even after 3684070360 bytes and reading for 716796 ms (4,90MB/sec * 2) with a buffer size of 512 kB
I was equal, even after 3684070360 bytes and reading for 652016 ms (5,39MB/sec * 2) with a buffer size of 1024 kB

使用的代码:

import junit.framework.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.text.DecimalFormat;
import java.text.NumberFormat;
import java.util.Arrays;
import java.util.concurrent.*;

public class FileCompare {

    private static final int MIN_BUFFER_SIZE = 1024 * 8;
    private static final int MAX_BUFFER_SIZE = 1024 * 1024;
    private String fileName1;
    private String fileName2;
    private long start;
    private long totalbytes;

    @Before
    public void createInputStream() {
        fileName1 = "bigFile.1";
        fileName2 = "bigFile.2";
    }

    @Test
    public void compareTwoFiles() throws IOException {
        System.out.println("comparing just with two streams");
        int currentBufferSize = MIN_BUFFER_SIZE;
        while (currentBufferSize <= MAX_BUFFER_SIZE) {
            compareWithBufferSize(currentBufferSize);
            currentBufferSize *= 2;
        }
    }

    @Test
    public void compareTwoFilesFutures() 
            throws IOException, ExecutionException, InterruptedException {
        System.out.println("comparing with threads");
        int myBufferSize = MIN_BUFFER_SIZE;
        while (myBufferSize <= MAX_BUFFER_SIZE) {
            start = System.currentTimeMillis();
            totalbytes = 0;
            compareWithBufferSizeFutures(myBufferSize);
            myBufferSize *= 2;
        }
    }

    @Test
    public void compareTwoFilesNio() throws IOException {
        System.out.println("comparing with nio");
        int myBufferSize = MIN_BUFFER_SIZE;
        while (myBufferSize <= MAX_BUFFER_SIZE) {
            start = System.currentTimeMillis();
            totalbytes = 0;
            boolean wasEqual = isEqualsNio(myBufferSize);

            if (wasEqual) {
                printAfterEquals(myBufferSize);
            } else {
                Assert.fail("files were not equal");
            }

            myBufferSize *= 2;
        }

    }

    private void compareWithBufferSize(int myBufferSize) throws IOException {
        final BufferedInputStream inputStream1 =
                new BufferedInputStream(
                        new FileInputStream(new File(fileName1)),
                        myBufferSize);
        byte[] buff1 = new byte[myBufferSize];
        final BufferedInputStream inputStream2 =
                new BufferedInputStream(
                        new FileInputStream(new File(fileName2)),
                        myBufferSize);
        byte[] buff2 = new byte[myBufferSize];
        int read1;

        start = System.currentTimeMillis();
        totalbytes = 0;
        while ((read1 = inputStream1.read(buff1)) != -1) {
            totalbytes += read1;
            int read2 = inputStream2.read(buff2);
            if (read1 != read2) {
                break;
            }
            if (!Arrays.equals(buff1, buff2)) {
                break;
            }
        }
        if (read1 == -1) {
            printAfterEquals(myBufferSize);
        } else {
            Assert.fail("files were not equal");
        }
        inputStream1.close();
        inputStream2.close();
    }

    private void compareWithBufferSizeFutures(int myBufferSize)
            throws ExecutionException, InterruptedException, IOException {
        final BufferedInputStream inputStream1 =
                new BufferedInputStream(
                        new FileInputStream(
                                new File(fileName1)),
                        myBufferSize);
        final BufferedInputStream inputStream2 =
                new BufferedInputStream(
                        new FileInputStream(
                                new File(fileName2)),
                        myBufferSize);

        final boolean wasEqual = isEqualsParallel(myBufferSize, inputStream1, inputStream2);

        if (wasEqual) {
            printAfterEquals(myBufferSize);
        } else {
            Assert.fail("files were not equal");
        }
        inputStream1.close();
        inputStream2.close();
    }

    private boolean isEqualsParallel(int myBufferSize
            , final BufferedInputStream inputStream1
            , final BufferedInputStream inputStream2)
            throws InterruptedException, ExecutionException {
        final byte[] buff1Even = new byte[myBufferSize];
        final byte[] buff1Odd = new byte[myBufferSize];
        final byte[] buff2Even = new byte[myBufferSize];
        final byte[] buff2Odd = new byte[myBufferSize];
        final Callable<Integer> read1Even = new Callable<Integer>() {
            public Integer call() throws Exception {
                return inputStream1.read(buff1Even);
            }
        };
        final Callable<Integer> read2Even = new Callable<Integer>() {
            public Integer call() throws Exception {
                return inputStream2.read(buff2Even);
            }
        };
        final Callable<Integer> read1Odd = new Callable<Integer>() {
            public Integer call() throws Exception {
                return inputStream1.read(buff1Odd);
            }
        };
        final Callable<Integer> read2Odd = new Callable<Integer>() {
            public Integer call() throws Exception {
                return inputStream2.read(buff2Odd);
            }
        };
        final Callable<Boolean> oddEqualsArray = new Callable<Boolean>() {
            public Boolean call() throws Exception {
                return Arrays.equals(buff1Odd, buff2Odd);
            }
        };
        final Callable<Boolean> evenEqualsArray = new Callable<Boolean>() {
            public Boolean call() throws Exception {
                return Arrays.equals(buff1Even, buff2Even);
            }
        };

        ExecutorService executor = Executors.newCachedThreadPool();
        boolean isEven = true;
        Future<Integer> read1 = null;
        Future<Integer> read2 = null;
        Future<Boolean> isEqual = null;
        int lastSize = 0;
        while (true) {
            if (isEqual != null) {
                if (!isEqual.get()) {
                    return false;
                } else if (lastSize == -1) {
                    return true;
                }
            }
            if (read1 != null) {
                lastSize = read1.get();
                totalbytes += lastSize;
                final int size2 = read2.get();
                if (lastSize != size2) {
                    return false;
                }
            }
            isEven = !isEven;
            if (isEven) {
                if (read1 != null) {
                    isEqual = executor.submit(oddEqualsArray);
                }
                read1 = executor.submit(read1Even);
                read2 = executor.submit(read2Even);
            } else {
                if (read1 != null) {
                    isEqual = executor.submit(evenEqualsArray);
                }
                read1 = executor.submit(read1Odd);
                read2 = executor.submit(read2Odd);
            }
        }
    }

    private boolean isEqualsNio(int myBufferSize) throws IOException {
        FileChannel first = null, seconde = null;
        try {
            first = new FileInputStream(fileName1).getChannel();
            seconde = new FileInputStream(fileName2).getChannel();
            if (first.size() != seconde.size()) {
                return false;
            }
            ByteBuffer firstBuffer = ByteBuffer.allocateDirect(myBufferSize);
            ByteBuffer secondBuffer = ByteBuffer.allocateDirect(myBufferSize);
            int firstRead, secondRead;
            while (first.position() < first.size()) {
                firstRead = first.read(firstBuffer);
                totalbytes += firstRead;
                secondRead = seconde.read(secondBuffer);
                if (firstRead != secondRead) {
                    return false;
                }
                if (!nioBuffersEqual(firstBuffer, secondBuffer, firstRead)) {
                    return false;
                }
            }
            return true;
        } finally {
            if (first != null) {
                first.close();
            }
            if (seconde != null) {
                seconde.close();
            }
        }
    }

    private static boolean nioBuffersEqual(ByteBuffer first, ByteBuffer second, final int length) {
        if (first.limit() != second.limit() || length > first.limit()) {
            return false;
        }
        first.rewind();
        second.rewind();
        for (int i = 0; i < length; i++) {
            if (first.get() != second.get()) {
                return false;
            }
        }
        return true;
    }

    private void printAfterEquals(int myBufferSize) {
        NumberFormat nf = new DecimalFormat("#.00");
        final long dur = System.currentTimeMillis() - start;
        double seconds = dur / 1000d;
        double megabytes = totalbytes / 1024 / 1024;
        double rate = (megabytes) / seconds;
        System.out.println("I was equal, even after " + totalbytes
                + " bytes and reading for " + dur
                + " ms (" + nf.format(rate) + "MB/sec * 2)" +
                " with a buffer size of " + myBufferSize / 1024 + " kB");
    }
}
于 2009-06-11T10:20:30.737 回答
8

有了这么大的文件,您将使用java.nio 获得更好的性能。

此外,使用 java 流读取单个字节可能非常慢。使用字节数组(根据我自己的经验,2-6K 元素,ymmv 似乎是特定于平台/应用程序的)将显着提高流的读取性能。

于 2009-06-08T11:00:29.757 回答
7

使用 Java 读取和写入文件可以同样快。您可以使用FileChannels。至于比较文件,显然这将花费大量时间比较字节与字节这是一个使用 FileChannels 和 ByteBuffers 的示例(可以进一步优化):

public static boolean compare(String firstPath, String secondPath, final int BUFFER_SIZE) throws IOException {
    FileChannel firstIn = null, secondIn = null;
    try {
        firstIn = new FileInputStream(firstPath).getChannel();
        secondIn = new FileInputStream(secondPath).getChannel();
        if (firstIn.size() != secondIn.size())
            return false;
        ByteBuffer firstBuffer = ByteBuffer.allocateDirect(BUFFER_SIZE);
        ByteBuffer secondBuffer = ByteBuffer.allocateDirect(BUFFER_SIZE);
        int firstRead, secondRead;
        while (firstIn.position() < firstIn.size()) {
            firstRead = firstIn.read(firstBuffer);
            secondRead = secondIn.read(secondBuffer);
            if (firstRead != secondRead)
                return false;
            if (!buffersEqual(firstBuffer, secondBuffer, firstRead))
                return false;
        }
        return true;
    } finally {
        if (firstIn != null) firstIn.close();
        if (secondIn != null) firstIn.close();
    }
}

private static boolean buffersEqual(ByteBuffer first, ByteBuffer second, final int length) {
    if (first.limit() != second.limit())
        return false;
    if (length > first.limit())
        return false;
    first.rewind(); second.rewind();
    for (int i=0; i<length; i++)
        if (first.get() != second.get())
            return false;
    return true;
}
于 2009-06-08T11:01:48.040 回答
6

以下是一篇很好的文章,介绍了在 java 中读取文件的不同方式的相对优点。可能有一些用处:

如何快速读取文件

于 2009-06-08T11:08:02.287 回答
6

修改您的 NIO 比较功能后,我得到以下结果。

I was equal, even after 4294967296 bytes and reading for 304594 ms (13.45MB/sec * 2) with a buffer size of 1024 kB
I was equal, even after 4294967296 bytes and reading for 225078 ms (18.20MB/sec * 2) with a buffer size of 4096 kB
I was equal, even after 4294967296 bytes and reading for 221351 ms (18.50MB/sec * 2) with a buffer size of 16384 kB

注意:这意味着正在以 37 MB/s 的速率读取文件

在更快的驱动器上运行相同的东西

I was equal, even after 4294967296 bytes and reading for 178087 ms (23.00MB/sec * 2) with a buffer size of 1024 kB
I was equal, even after 4294967296 bytes and reading for 119084 ms (34.40MB/sec * 2) with a buffer size of 4096 kB
I was equal, even after 4294967296 bytes and reading for 109549 ms (37.39MB/sec * 2) with a buffer size of 16384 kB

注意:这意味着正在以 74.8 MB/s 的速率读取文件

private static boolean nioBuffersEqual(ByteBuffer first, ByteBuffer second, final int length) {
    if (first.limit() != second.limit() || length > first.limit()) {
        return false;
    }
    first.rewind();
    second.rewind();
    int i;
    for (i = 0; i < length-7; i+=8) {
        if (first.getLong() != second.getLong()) {
            return false;
        }
    }
    for (; i < length; i++) {
        if (first.get() != second.get()) {
            return false;
        }
    }
    return true;
}
于 2009-06-12T06:58:44.847 回答
3

我发现这篇文章中链接的很多文章都已经过时了(也有一些非常有见地的东西)。有一些 2001 年的文章链接,这些信息充其量是有问题的。机械同情的 Martin Thompson 在 2011 年写了很多关于这方面的文章。关于这方面的背景和理论,请参阅他所写的内容。

我发现 NIO 与否 NIO 与性能关系不大。它更多的是关于输出缓冲区的大小(在那个缓冲区上读取字节数组)。NIO 没有魔法让它去快速的网络规模酱。

我能够以 Martin 的例子并使用 1.0 时代的 OutputStream 并让它尖叫。NIO 也很快,但最大的指标只是输出缓冲区的大小,而不是您是否使用 NIO,除非您使用的是内存映射的 NIO,否则这很重要。:)

如果您想了解这方面的最新权威信息,请参阅 Martin 的博客:

http://mechanical-sympathy.blogspot.com/2011/12/java-sequential-io-performance.html

如果您想了解 NIO 如何没有那么大的差异(因为我能够使用更快的常规 IO 编写示例),请参见:

http://www.dzone.com/links/fast_java_io_nio_is_always_faster_than_fileoutput.html

我已经在带有快速硬盘的新 Windows 笔记本电脑、带有 SSD、EC2 xlarge 和 4x 大的 EC2 以及最大 IOPS/高速 I/O 的新 Windows 笔记本电脑上测试了我的假设(很快就会在大磁盘 NAS 光纤磁盘上阵列)所以它可以工作(对于较小的 EC2 实例,它存在一些问题,但如果您关心性能......您会使用小型 EC2 实例吗?)。如果你使用真正的硬件,在我迄今为止的测试中,传统的 IO 总是胜出。如果您使用高/IO EC2,那么这也是一个明显的赢家。如果您使用供电不足的 EC2 实例,NIO 可以获胜。

基准测试无可替代。

无论如何,我不是专家,我只是使用 Martin Thompson 爵士在他的博客文章中写的框架进行了一些实证测试。

我把它带到了下一步,并使用Files.newInputStream(来自 JDK 7)和TransferQueue来创建一个让 Java I/O 尖叫的方法(即使在小型 EC2 实例上)。配方可以在本文档的底部找到 Boon ( https://github.com/RichardHightower/boon/wiki/Auto-Growable-Byte-Buffer-like-a-ByteBuilder )。这使我可以使用传统的 OutputStream,但在较小的 EC2 实例上运行良好。(我是 Boon 的主要作者。但我正在接受新作者。薪水很糟糕。每小时 0 美元。但好消息是,我可以随时加倍你的薪水。)

我的 2 美分。

看到这个,看看为什么TransferQueue很重要。http://php.sabscape.com/blog/?p=557

主要学习:

  1. 如果您关心性能,永远永远不要使用BufferedOutputStream
  2. NIO 并不总是等同于性能。
  3. 缓冲区大小最重要。
  4. 为高速写入回收缓冲区至关重要。
  5. GC 可以/将/确实会破坏您的高速写入性能。
  6. 您必须有一些机制来重用已用过的缓冲区。
于 2013-11-09T22:48:39.840 回答
2

您可以查看Suns Article for I/O Tuning(虽然已经有点过时了),也许您可​​以找到那里的示例和您的代码之间的相似之处。还可以查看java.nio包,它包含比 java.io 更快的 I/O 元素。Dr. Dobbs Journal 有一篇关于使用 java.nio 的高性能 IO的非常好的文章。

如果是这样,那里有更多示例和调优技巧可以帮助您加快代码速度。

此外,Arrays 类具有用于比较内置字节数组的方法,也许这些方法也可以用来使事情变得更快并稍微清理你的循环。

于 2009-06-08T11:06:28.467 回答
1

为了更好地比较,请尝试一次复制两个文件。硬盘驱动器读取一个文件比读取两个文件效率更高(因为磁头必须来回移动才能读取) 减少这种情况的一种方法是使用更大的缓冲区,例如 16 MB。与字节缓冲区。

使用 ByteBuffer,您可以通过将 long 值与 getLong() 进行比较来一次比较 8 个字节

如果您的 Java 高效,则大部分工作都在磁盘/操作系统中进行读写,因此它不应该比使用任何其他语言慢很多(因为磁盘/操作系统是瓶颈)

在确定它不是代码中的错误之前,不要假设 Java 很慢。

于 2009-06-09T06:32:19.717 回答
0

DMA/SATA 是硬件/低级技术,对任何编程语言都是不可见的。

我相信,对于内存映射输入/输出,您应该使用 java.nio。

您确定您没有按一个字节读取这些文件吗?那会很浪费,我建议逐块进行,每个块应该是 64 兆字节,以尽量减少搜索。

于 2009-06-08T11:03:07.130 回答
-1

尝试将输入流上的缓冲区设置为最多几兆字节。

于 2009-06-13T13:47:10.380 回答