java - 编写“压缩”数组以提高 IO 性能？

Question

我有一个长度为 2.2 亿（固定）的 int 和 float 数组。现在，我想将这些数组存储/上传到内存和磁盘。目前，我正在使用 Java NIO 的 FileChannel 和 MappedByteBuffer 来解决这个问题。它工作正常，但需要大约 5 秒（挂钟时间）将阵列存储/上传到内存到磁盘/从内存上传到磁盘。现在，我想让它更快。

在这里，我应该提到这些数组元素中的大多数都是 0（接近 52%）。

喜欢：

int arr1 [] = { 0 , 0 , 6 , 7 , 1, 0 , 0 ...}

任何人都可以帮助我，有什么好方法可以通过不存储或加载那些 0 来提高速度。这可以通过使用 Arrays.fill (array , 0) 来补偿。

score 5 · Accepted Answer

以下方法需要磁盘上的 n / 8 + nz * 4 字节，其中 n 是数组的大小，而 nz 是非零条目的数量。对于 52% 的零条目，您可以将存储大小减少 52% - 3% = 49%。

你可以这样做：

void write(int[] array) {
    BitSet zeroes = new BitSet();
    for (int i = 0; i < array.length; i++)
        zeroes.set(i, array[i] == 0);
    write(zeroes); // one bit per index
    for (int i = 0; i < array.length; i++)
        if (array[i] != 0)
            write(array[y]);
}

int[] read() {
    BitSet zeroes = readBitSet();
    array = new int[zeroes.length];
    for (int i = 0; i < zeroes.length; i++) {
        if (zeroes.get(i)) {
            // nothing to do (array[i] was initialized to 0)
        } else {
            array[i] = readInt();
        }
    }
}

编辑：你说这稍微慢一点意味着磁盘不是瓶颈。您可以通过在构建时写入位集来调整上述方法，因此您不必在将位集写入磁盘之前将其写入内存。此外，通过逐字写入位集并穿插实际数据，我们可以只对数组进行一次传递，从而减少缓存未命中：

void write(int[] array) {
    writeInt(array.length);
    int ni;
    for (int i = 0; i < array.length; i = ni) {
        ni = i + 32;
        int zeroesMap = 0;
        for (j = i + 31; j >= i; j--) {
            zeroesMap <<= 1;
            if (array[j] == 0) {
                zeroesMap |= 1;
            }
        }
        writeInt(zeroesMap);
        for (j = i; j < ni; j++)
            if (array[j] != 0) {
                writeInt(array[j]);
            }
        }
    }
}

int[] read() {
    int[] array = new int[readInt()];
    int ni;
    for (int i = 0; i < array.length; i = ni) {
        ni = i + 32;
        zeroesMap = readInt();
        for (j = i; j < ni; j++) {
            if (zeroesMap & 1 == 1) {
                // nothing to do (array[i] was initialized to 0)
            } else {
                array[j] = readInt();
            }
            zeroesMap >>= 1;
        }
    }
    return array;
}

（前面的代码假设 array.length 是 32 的倍数。如果不是，则以您喜欢的任何方式写入数组的最后一个切片）

如果这也不能减少处理时间，那么压缩不是要走的路（我认为任何通用压缩算法都不会比上述更快）。

score 4 · Accepted Answer

根据分布，考虑运行长度编码：

运行长度编码 (RLE) 是一种非常简单的数据压缩形式，其中数据运行（即，相同数据值出现在许多连续数据元素中的序列）被存储为单个数据值和计数，而不是作为原始运行。这对于包含许多此类运行的数据最有用。

这很简单......这很好，也可能很糟糕;-)

score 2 · Accepted Answer

如果您愿意自己编写序列化-反序列化代码，您可以存储一系列指示这些零所在位置的范围（使用特殊标记）以及实际的非零数据，而不是存储所有零。

因此，您示例中的数组： { 0 , 0 , 6 , 7 , 1, 0 , 0 ...} 可以存储为：

%0-1、6、7、1 %5-6

读取此数据时，如果您点击 %，则表示您有一个范围，您读取开始和结束并填充零。然后你继续看到一个非#，这意味着你达到了一个实际值。

在具有大量连续值序列的稀疏数组中，这将产生很大的压缩。

score 2 · Accepted Answer

java中有一个标准的压缩工具：java.util.zip - 它是通用库，但由于绝对的可用性是一个不错的解决方案。如果需要，应该研究专门的压缩，编码，我很少推荐 zip 作为选择的灵魂。

这是一个如何通过 .zip 处理 zip 的示例Deflater/Inflater。大多数人都知道 ZipInput/Output Stream（尤其是 Gzip）。他们在处理来自 mem->zlib 和 esp 的副本时都有缺点。GZip 完全是一场灾难，因为 CRC32 调用了本机代码（调用本机代码会消除优化能力并引入更多性能损失）。

几个重要的注意事项：不要提高 zip 压缩率，这会破坏任何性能 - 当然可以尝试并在 CPU 和磁盘活动之间调整最佳比例。

该代码还展示了一个真正的缺点java.util.zip- 它不支持直接缓冲区。支持是微不足道的，但没有人费心去做。直接缓冲区将节省少量内存副本并减少内存占用。

最后一点： (j)zlib有 java 版本，它击败了原生 impl。在压缩上相当不错。

package t1;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.Random;
import java.util.zip.DataFormatException;
import java.util.zip.Deflater;
import java.util.zip.Inflater;

public class ZInt {
    private static final int bucketSize = 1<<17;//in real world should not be const, but we bored horribly
    static final int zipLevel = 2;//feel free to experiement, higher compression (5+)is likely to be total waste
    
    static void write(int[] a, File file, boolean sync) throws IOException{
        byte[] bucket = new byte[Math.min(bucketSize,  Math.max(1<<13, Integer.highestOneBit(a.length >>3)))];//128KB bucket
        byte[] zipOut = new byte[bucket.length];
        
        final FileOutputStream fout = new FileOutputStream(file);
        FileChannel channel = fout.getChannel();
        try{
            
            ByteBuffer buf = ByteBuffer.wrap(bucket);
            //unfortunately java.util.zip doesn't support Direct Buffer - that would be the perfect fit
            ByteBuffer out = ByteBuffer.wrap(zipOut);
            out.putInt(a.length);//write length aka header
            if (a.length==0){
                doWrite(channel, out, 0);
                return;
            }
            
            Deflater deflater = new Deflater(zipLevel, false);
            try{
                for (int i=0;i<a.length;){
                    i = put(a, buf, i);
                    buf.flip();
                    deflater.setInput(bucket, buf.position(), buf.limit());

                    if (i==a.length)
                        deflater.finish();

                    //hacking and using bucket here is tempting since it's copied twice but well
                    for (int n; (n= deflater.deflate(zipOut, out.position(), out.remaining()))>0;){
                        doWrite(channel, out, n);
                    }
                    buf.clear();
                }
                
            }finally{
                deflater.end();
            }
        }finally{
            if (sync) 
                fout.getFD().sync();
            channel.close();
        }
    }

    static int[] read(File file) throws IOException, DataFormatException{
        FileChannel channel = new FileInputStream(file).getChannel();
        try{
            byte[] in = new byte[(int)Math.min(bucketSize, channel.size())];
            ByteBuffer buf = ByteBuffer.wrap(in);

            channel.read(buf);
            buf.flip();
            int[] a = new int[buf.getInt()];
            if (a.length==0)
                return a;
            int i=0;
            byte[] inflated = new byte[Math.min(1<<17, a.length*4)];
            ByteBuffer intBuffer = ByteBuffer.wrap(inflated);
            Inflater inflater = new Inflater(false);
            try{
                do{
                    if (!buf.hasRemaining()){
                        buf.clear();
                        channel.read(buf);
                        buf.flip();
                    }
                    inflater.setInput(in, buf.position(), buf.remaining());
                    buf.position(buf.position()+buf.remaining());//simulate all read

                    for (;;){
                        int n = inflater.inflate(inflated,intBuffer.position(), intBuffer.remaining());
                        if (n==0)
                            break;
                        intBuffer.position(intBuffer.position()+n).flip();
                        for (;intBuffer.remaining()>3 && i<a.length;i++){//need at least 4 bytes to form an int
                            a[i] = intBuffer.getInt();
                        }
                        intBuffer.compact();
                    }

                }while (channel.position()<channel.size() && i<a.length);
            }finally{
                inflater.end();
            }
            //          System.out.printf("read ints: %d - channel.position:%d %n", i, channel.position());
            return a;
        }finally{
            channel.close();
        }
    }

    private static void doWrite(FileChannel channel, ByteBuffer out, int n) throws IOException {
        out.position(out.position()+n).flip();
        while (out.hasRemaining())
            channel.write(out);
        out.clear();
    }
    private static int put(int[] a, ByteBuffer buf, int i) {
        for (;buf.hasRemaining() && i<a.length;){
            buf.putInt(a[i++]);
        }
        return i;
    }
    
    private static int[] generateRandom(int len){
        Random r = new Random(17);
        int[] n = new int[len];
        for (int i=0;i<len;i++){
            n[i]= r.nextBoolean()?0: r.nextInt(1<<23);//limit bounds to have any sensible compression
        }
        return n;
    }
    public static void main(String[] args) throws Throwable{
        File file = new File("xxx.xxx");
        int[] n = generateRandom(3000000); //{0,2,4,1,2,3};
        long start = System.nanoTime();
        write(n, file, false);
        long elapsed = System.nanoTime() - start;//elapsed will be fairer if the sync is true
        
        System.out.printf("File length: %d, for %d ints, ratio %.2f in %.2fms %n", file.length(), n.length, ((double)file.length())/4/n.length, java.math.BigDecimal.valueOf(elapsed, 6) );
        
        int[] m = read(file);
        
        //compare, Arrays.equals doesn't return position, so it sucks/kinda
        for (int i=0; i<n.length; i++){
            if (m[i]!=n[i]){
                System.err.printf("Failed at %d%n",i);
                break;
            }
        }
        System.out.printf("All done!");
    };
    
}

_{请注意，代码不是正确的基准！}
延迟回复的原因是代码很无聊，又是一个 zip 示例，抱歉

java - 编写“压缩”数组以提高 IO 性能？

4 回答 4

Related

Reference