python - Compress numpy arrays efficiently

Question

I tried various methods to do data compression when saving to disk some numpy arrays.

These 1D arrays contain sampled data at a certain sampling rate (can be sound recorded with a microphone, or any other measurment with any sensor) : the data is essentially continuous (in a mathematical sense ; of course after sampling it is now discrete data).

I tried with HDF5 (h5py) :

f.create_dataset("myarray1", myarray, compression="gzip", compression_opts=9)

but this is quite slow, and the compression ratio is not the best we can expect.

I also tried with

numpy.savez_compressed()

but once again it may not be the best compression algorithm for such data (described before).

What would you choose for better compression ratio on a numpy array, with such data ?

(I thought about things like lossless FLAC (initially designed for audio), but is there an easy way to apply such an algorithm on numpy data ?)

score 27 · Accepted Answer

我现在应该做什么：

import gzip
import numpy

f = gzip.GzipFile("my_array.npy.gz", "w")
numpy.save(file=f, arr=my_array)
f.close()

score 17 · Accepted Answer

噪音是不可压缩的。因此，无论压缩算法如何，您拥有的任何噪声数据部分都将 1:1 进入压缩数据，除非您以某种方式丢弃它（有损压缩）。如果每个样本有 24 位，有效位数 (ENOB) 等于 16 位，则剩余的 24-16 = 8 位噪声会将您的最大无损压缩比限制为 3:1，即使您的（无噪声）数据是完全可压缩的。非均匀噪声可压缩到不均匀的程度；您可能想查看噪声的有效熵以确定它的可压缩性。
压缩数据基于对其建模（部分是为了消除冗余，但也部分是为了您可以将噪声与噪声分离并丢弃）。例如，如果您知道您的数据带宽限制为 10MHz，并且您以 200MHz 进行采样，您可以进行 FFT，将高频归零，并仅存储低频的系数（在此示例中：10:1压缩）。有一个与此相关的称为“压缩感知”的整个领域。
一个实用的建议，适用于多种合理连续的数据：去噪 -> 带宽限制 -> delta compress -> gzip（或 xz 等）。降噪可以与带宽限制相同，也可以是非线性滤波器，如运行中值。带宽限制可以通过 FIR/IIR 来实现。Delta 压缩只是 y[n] = x[n] - x[n-1]。

编辑插图：

from pylab import *
import numpy
import numpy.random
import os.path
import subprocess

# create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)
N = 1000000
data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \
    numpy.random.randn(N) * (1<<7)).astype(int32)

numpy.save('data.npy', data)
print os.path.getsize('data.npy')
# 4000080 uncompressed size

subprocess.call('xz -9 data.npy', shell=True)
print os.path.getsize('data.npy.xz')
# 1484192 compressed size
# 11.87 bits per sample, ~8 bits of that is noise

data_quantized = data / (1<<8)
numpy.save('data_quantized.npy', data_quantized)
subprocess.call('xz -9 data_quantized.npy', shell=True)
print os.path.getsize('data_quantized.npy.xz')
# 318380
# still have 16 bits of signal, but only takes 2.55 bits per sample to store it

score 4 · Accepted Answer

使用压缩保存 HDF5 文件可以非常快速和高效：这完全取决于压缩算法，以及您是否希望它在保存时快速，或在读取时快速，或两者兼而有之。而且，自然地，关于数据本身，正如上面解释的那样。GZIP 往往介于两者之间，但压缩率较低。BZIP2 两边都很慢，虽然比例更好。BLOSC 是我发现的一种算法，它得到了相当的压缩，并且在两端都很快。BLOSC 的缺点是它并未在 HDF5 的所有实现中实现。因此，您的程序可能不可移植。您总是需要至少进行一些测试，以选择最适合您需要的配置。

score 2 · Accepted Answer

什么构成最佳压缩（如果有）很大程度上取决于数据的性质。如果确实需要无损压缩，许多类型的测量数据实际上是完全不可压缩的。

pytables 文档包含许多有用的数据压缩指南。它还详细说明了速度权衡等；事实证明，更高的压缩级别通常是浪费时间。

http://pytables.github.io/usersguide/optimization.html

请注意，这可能会尽可能好。对于整数测量，shuffle 过滤器与简单的 zip 类型压缩的组合通常可以很好地工作。这个过滤器非常有效地利用了最高端字节通常为 0 的常见情况，并且仅包含在内以防止溢出。

score 1 · Accepted Answer

您可能想尝试blz。它可以非常有效地压缩二进制数据。

import blz
# this stores the array in memory
blz.barray(myarray) 
# this stores the array on disk
blz.barray(myarray, rootdir='arrays')

它将数组存储在文件中或压缩在内存中。压缩基于blosc。请参阅scipy 视频了解一些背景信息。

score 1 · Accepted Answer

首先，对于一般数据集，使用大致连续的数据集显着提高压缩率的shuffle=True论点。create_dataset它非常巧妙地重新排列要压缩的位，以便（对于连续数据）位变化缓慢，这意味着可以更好地压缩它们。根据我的经验，它会稍微减慢压缩速度，但根据我的经验，它可以大大提高压缩率。它不是有损的，因此您确实会得到与输入相同的数据。

如果您不太关心准确性，您还可以使用scaleoffset参数来限制存储的位数。不过要小心，因为这听起来不像。特别是，它是绝对精度，而不是相对精度。例如，如果您通过scaleoffset=8，但您的数据点较少，那么1e-8您只会得到零。当然，如果您已将数据缩放到最大 1 左右，并且认为您听不到小于百万分之一的差异，那么您可以通过scaleoffset=6并获得很好的压缩而无需太多工作。

但特别是对于音频，我希望你想要使用 FLAC 是正确的，因为它的开发人员已经投入了大量的思考，在压缩与保留可区分的细节之间取得平衡。您可以使用 scipy 转换为 WAV，然后再转换为 FLAC。

python - Compress numpy arrays efficiently

6 回答 6

Related

Reference