java - Java 并行文件处理

Question

我有以下代码：

import java.io.*;
import java.util.concurrent.* ;
public class Example{
public static void main(String args[]) {
    try {
        FileOutputStream fos = new FileOutputStream("1.dat");
        DataOutputStream dos = new DataOutputStream(fos);

        for (int i = 0; i < 200000; i++) {
            dos.writeInt(i);
        }
        dos.close();                                                         // Two sample files created

        FileOutputStream fos1 = new FileOutputStream("2.dat");
        DataOutputStream dos1 = new DataOutputStream(fos1);

        for (int i = 200000; i < 400000; i++) {
            dos1.writeInt(i);
        }
        dos1.close();

        Exampless.createArray(200000); //Create a shared array
        Exampless ex1 = new Exampless("1.dat");
        Exampless ex2 = new Exampless("2.dat");
        ExecutorService executor = Executors.newFixedThreadPool(2); //Exexuted parallaly to cont number of matches in two file
        long startTime = System.nanoTime();
        long endTime;
        Future<Integer> future1 = executor.submit(ex1);
        Future<Integer> future2 = executor.submit(ex2);
        int count1 = future1.get();
        int count2 = future2.get();
        endTime = System.nanoTime();
        long duration = endTime - startTime;
        System.out.println("duration with threads:"+duration);
        executor.shutdown();
        System.out.println("Matches: " + (count1 + count2));

        startTime = System.nanoTime();
        ex1.call();
        ex2.call();
        endTime = System.nanoTime();
        duration = endTime - startTime;
        System.out.println("duration without threads:"+duration);

    } catch (Exception e) {
        System.err.println("Error: " + e.getMessage());
    }
}
}

class Exampless implements Callable {

public static int[] arr = new int[20000];
public String _name;

public Exampless(String name) {
    this._name = name;
}

static void createArray(int z) {
    for (int i = z; i < z + 20000; i++) { //shared array
        arr[i - z] = i;
    }
}

public Object call() {
    try {
        int cnt = 0;
        FileInputStream fin = new FileInputStream(_name);
        DataInputStream din = new DataInputStream(fin);      // read file and calculate number of matches
        for (int i = 0; i < 20000; i++) {
            int c = din.readInt();
            if (c == arr[i]) {
                cnt++;
            }
        }
        return cnt ;
    } catch (Exception e) {
        System.err.println("Error: " + e.getMessage());
    }
    return -1 ;
}

}

我试图用两个文件计算数组中的匹配数。现在，虽然我在两个线程上运行它，但代码表现不佳，因为：

（在单线程上运行，文件 1 + 文件 2 读取时间）<（文件 1 || 文件 2 多线程读取时间）。

谁能帮我解决这个问题（我有 2 个核心 CPU，文件大小约为 1.5 GB）。

score 7 · Accepted Answer

在第一种情况下，您正在按顺序读取一个文件，逐字节，逐块。这与磁盘 I/O 一样快，前提是文件不是很碎片化。当您完成第一个文件时，磁盘/操作系统会找到第二个文件的开头并继续非常有效地线性读取磁盘。

在第二种情况下，您不断地在第一个文件和第二个文件之间切换，迫使磁盘从一个地方到另一个地方寻找。这个额外的搜索时间（大约 10 毫秒）是您困惑的根源。

哦，您知道磁盘访问是单线程的，并且您的任务是 I/O 绑定的，因此只要您从同一个物理磁盘读取，就没有办法将此任务拆分为多个线程会有所帮助？只有在以下情况下，您的方法才能被证明是合理的：

除了从文件中读取之外，每个线程还执行一些 CPU 密集型或阻塞操作，与 I/O 相比要慢一个数量级。
文件位于不同的物理驱动器（不同的分区是不够的）或某些 RAID 配置上
您正在使用 SSD 驱动器

score 1 · Accepted Answer

正如 Tomasz 从磁盘读取数据所指出的那样，您不会从多线程中获得任何好处。如果您对检查进行多线程处理，则速度可能会有所提高，即将文件中的数据按顺序加载到数组中，然后线程并行执行检查。但是考虑到您的文件很小（~80kb）以及您只是比较整数的事实，我怀疑性能改进是否值得付出努力。

如果您不使用 readInt()，肯定会提高您的执行速度。由于您知道要比较 20000 个整数，因此您应该为每个文件（或至少在块中）一次将所有 20000 个整数读入一个数组，而不是调用 readInt() 函数 20000 次。

java - Java 并行文件处理

2 回答 2

Related

Reference