matlab - Matlab：读取大型二进制文件的部分/序列的最快方法

Question

我想从一个大的（约 11 GB）二进制文件中读取部分内容。当前可行的解决方案是使用加载整个文件 ( raw_data) fread()，然后裁剪出感兴趣的部分 ( data)。

问题：是否有更快的方法来读取文件的小部分（占总文件的 1-2%，部分顺序读取），给定 Matlab 中的二进制掩码（即特定字节的逻辑索引）？具体如下。

我的具体案例的注意事项：

data感兴趣的（26+e6 字节，或约 24 MB）大约是raw_data（1.2e+10 字节或约 11 GB）的 2%
每 600.000 字节包含 ca 6.500 字节读取，可以分解为大约 1.200 个读取跳过周期（例如“读取 10 个字节，跳过 5000 个字节”）。
整个文件的读取指令可以分解为大约 20.000 个类似但（不完全相同）的读取跳过周期（即大约 20.000x1.200 个读取跳过周期）
从 GPFS（并行文件系统）读取文件
过多的 RAM、最新的 Matlab 版本和所有工具箱都可用于该任务

我最初关于 fread-fseek 循环的想法被证明比读取整个文件要慢得多（见下面的伪代码）。分析显示fread()是最慢的（被调用超过一百万次可能对这里的专家来说是显而易见的）。

我考虑的替代方案：memmapfile()[ ref ] 据我所知，没有可行的读取多个小部分。MappedTensor库可能是我要研究的下一件事。相关但没有帮助，只是链接到文章：1 , 2。

%open file
fi=fopen('data.bin');

%example read-skip data
f_reads = [20  10   6  20  40];  %read this number of bytes
f_skips = [900 6000 40 300 600]; %skip these bytes after each read instruction

data = []; %save the result here
fseek(fi,90000,'bof'); %skip initial bytes until first read

%read the file
for ind=1:nbr_read_skip_cylces-1
  tmp_data = fread(fi,f_reads(ind));
  data = [data; tmp_data]; %add newly read bytes to data variable 
  fseek(fi,f_skips(ind),'cof'); %skip to next read position
end

仅供参考：为了获得概览和透明度，我编制了一些图（下图）的第一个 ca 6.500 读取位置（我的实际数据），在折叠成 fread-fseek 对后，可以总结为 1.200 fread- fseek 对。

score 2 · Accepted Answer

我会做两件事来加快你的代码：

预分配数据数组。
编写一个 C MEX 文件来调用fread和fseek.

这是我使用MATLAB 或 Cfread进行比较的快速测试：fseek

%% Create large binary file
data = 1:10000000; % 80 MB
fi = fopen('data.bin', 'wb');
fwrite(fi, data, 'double');
fclose(fi);

n_read = 1;
n_skip = 99;

%% Read using MATLAB
tic
fi = fopen('data.bin', 'rb');
fseek(fi, 0, 'eof');
sz = ftell(fi);
sz = floor(sz / (n_read + n_skip));
data = zeros(1, sz);
fseek(fi, 0, 'bof');
for ind = 1:sz
  data(ind) = fread(fi, n_read, 'int8');
  fseek(fi, n_skip, 'cof');
end
toc

%% Read using C MEX-file
mex fread_test_mex.c

tic
data = fread_test_mex('data.bin', n_read, n_skip);
toc

这是fread_test_mex.c：

#include <stdio.h>
#include <mex.h>

void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, const mxArray *prhs[])
{
   // No testing of inputs...
   // inputs = 'data.bin', 1, 99
   char* fname = mxArrayToString(prhs[0]);
   int n_read = mxGetScalar(prhs[1]);
   int n_skip = mxGetScalar(prhs[2]);
   FILE* fi = fopen(fname, "rb");
   fseek(fi, 0L, SEEK_END);
   int sz = ftell(fi);
   sz /= n_read + n_skip;
   plhs[0] = mxCreateNumericMatrix(1, sz, mxDOUBLE_CLASS, mxREAL);
   double* data = mxGetPr(plhs[0]);
   fseek(fi, 0L, SEEK_SET);
   char buffer[1];
   for(int ind = 1; ind < sz; ++ind) {
      fread(buffer, 1, n_read, fi);
      data[ind] = buffer[0];
      fseek(fi, n_skip, SEEK_CUR);
   }
   fclose(fi);
}

我看到这个：

Elapsed time is 6.785304 seconds.
Building with 'Xcode with Clang'.
MEX completed successfully.
Elapsed time is 1.376540 seconds.

也就是说，读取数据的速度是使用 C MEX 文件的 5 倍。那个时间包括将 MEX 文件加载到内存中。第二次运行会快一点（1.14 秒），因为 MEX 文件已经加载。

在 MATLAB 代码中，如果我data = [];像 OP 一样每次读取时都初始化然后扩展矩阵：

tmp = fread(fi, n_read, 'int8');
data = [data, tmp];

那么该循环的执行时间为 159 秒，其中 92.0% 的时间花费在该data = [data, tmp]行中。预分配真的很重要！

matlab - Matlab：读取大型二进制文件的部分/序列的最快方法

1 回答 1

Related

Reference