2

我想用 Dask 做以下事情:

  1. 从 HDF5 文件加载矩阵
  2. 并行计算每个条目

这是我的代码:

def blocked_func(x):
    return np.random.random()

with h5py.File(file_path) as f:
    d = f['/data']
    arr = da.from_array(d, chunks=(chunks_row, chunks_col))

    arr2 = arr.map_blocks(blocked_func, dtype='float32').compute()

但是代码会抛出以下错误:

File ".../remote_fr_thinkpad/test_big_data.py", line 43, in <module>
    arr2 = arr.map_blocks(blocked_func, dtype='float32').compute()
  File ".../anaconda3/lib/python3.7/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File ".../anaconda3/lib/python3.7/site-packages/dask/base.py", line 399, in compute
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File ".../anaconda3/lib/python3.7/site-packages/dask/base.py", line 399, in <listcomp>
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File ".../anaconda3/lib/python3.7/site-packages/dask/array/core.py", line 779, in finalize
    return concatenate3(results)
  File ".../anaconda3/lib/python3.7/site-packages/dask/array/core.py", line 3497, in concatenate3
    chunks = chunks_from_arrays(arrays)
  File ".../anaconda3/lib/python3.7/site-packages/dask/array/core.py", line 3327, in chunks_from_arrays
    result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
  File ".../anaconda3/lib/python3.7/site-packages/dask/array/core.py", line 3327, in <listcomp>
    result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
IndexError: tuple index out of range

我四处搜索并尝试了 dask 的 gu_func,但这引发了同样的错误。

谢谢你的帮助。

4

1 回答 1

1

map_block期望blocked_func返回与其 input 形状相同的数组(chunks_row, chunks_col),而实际上它只返回一个浮点数。

尝试使用

1) 保持形状的函数,例如:

def blocked_func(x):
    return x*2

或者

2)告诉map_blocks输出的形状会不同:

arr2 = arr.map_blocks(blocked_func, chunks=(1,1), dtype='float32').compute()

但将输入数组的维数保持在 中blocked_func,例如:

def blocked_func(x):
    return np.random.random()[None,None]
    # or like this
    # return np.array([1,1])
于 2019-07-10T10:19:07.743 回答