0 投票

1 回答

2314 浏览

python - 使用 h5py 读取 HDF5 文件时使用 python 切片对象？

我正在尝试使用 python 切片对象使用h5py模块访问 HDF5 文件中的数据。我把这个例子放在一起来说明它适用于numpy数组，但不适用于h5py.

这给出了以下输出：

有谁知道这是否不可能h5py？如果不是，是否有另一种方法来切片h5py，使用对象或变量，而不是像f['data'][0:3,2:5]我的示例中那样显式键入切片？

2013-12-11T15:33:19.573

0 投票

1 回答

1039 浏览

python-3.x - h5py，零星的书写错误

我有一些浮点数要存储在一个大（500K x 500K）矩阵中。我通过使用可变大小的数组（根据某些特定条件）将它们存储在块中。

我有一个并行代码（Python3.3 和 h5py），它生成数组并将它们放入共享队列中，还有一个从队列中弹出并将它们一一写入 HDF5 矩阵的专用进程。它在大约 90% 的时间内按预期工作。

有时，我会遇到特定数组的写入错误。如果我多次运行它，错误的数组就会一直变化。

这是代码：

这是错误：

如果我在写入任务中插入两秒的暂停（time.sleep(2)），那么问题似乎就解决了（尽管我每次写入不能浪费 2 秒，因为我需要写入超过 250.000 次）。如果我捕获写入异常并将错误数组放入队列中，则脚本将永远不会停止（大概）。

我正在使用 CentOS (2.6.32-279.11.1.el6.x86_64)。有什么见解吗？

非常感谢。

python-3.x filesystems hdf5 h5py

2013-12-12T18:57:25.300

0 投票

2 回答

124022 浏览

python - 输入和输出 numpy 数组到 h5py

我有一个 Python 代码，其输出是一个在此处输入图像描述大小矩阵，其条目都是float. 如果我使用扩展名保存它，.dat则文件大小约为 500 MB。我读到 usingh5py大大减少了文件大小。所以，假设我有一个名为A. 如何将其保存到 h5py 文件中？另外，我如何读取同一个文件并将其作为 numpy 数组放在不同的代码中，因为我需要对数组进行操作？

python arrays numpy h5py

2014-01-04T23:41:12.490

0 投票

5 回答

1780 浏览

numpy - Python particles simulator: out-of-core processing

Problem description

In writing a Monte Carlo particle simulator (brownian motion and photon emission) in python/numpy. I need to save the simulation output (>>10GB) to a file and process the data in a second step. Compatibility with both Windows and Linux is important.

The number of particles (n_particles) is 10-100. The number of time-steps (time_size) is ~10^9.

The simulation has 3 steps (the code below is for an all-in-RAM version):

Simulate (and store) an emission rate array (contains many almost-0 elements):
- shape (n_particles x time_size), float32, size 80GB
Compute counts array, (random values from a Poisson process with previously computed rates):
- shape (n_particles x time_size), uint8, size 20GB
  /li>
Find timestamps (or index) of counts. Counts are almost always 0, so the timestamp arrays will fit in RAM.
/li>

I do step 1 once, then repeat step 2-3 many (~100) times. In the future I may need to pre-process emission (apply cumsum or other functions) before computing counts.

Question

I have a working in-memory implementation and I'm trying to understand what is the best approach to implement an out-of-core version that can scale to (much) longer simulations.

What I would like it exist

I need to save arrays to a file, and I would like to use a single file for a simulation. I also need a "simple" way to store and recall a dictionary of simulation parameter (scalars).

Ideally I would like a file-backed numpy array that I can preallocate and fill in chunks. Then, I would like the numpy array methods (max, cumsum, ...) to work transparently, requiring only a chunksize keyword to specify how much of the array to load at each iteration.

Even better, I would like a Numexpr that operates not between cache and RAM but between RAM and hard drive.

What are the practical options

As a first option I started experimenting with pyTables, but I'm not happy with its complexity and abstractions (so different from numpy). Moreover my current solution (read below) is UGLY and not very efficient.

So my options for which I seek an answer are

implement a numpy array with required functionality (how?)
use pytable in a smarter way (different data-structures/methods)
use another library: h5py, blaze, pandas... (I haven't tried any of them so far).

Tentative solution (pyTables)

I save the simulation parameters in '/parameters' group: each parameter is converted to a numpy array scalar. Verbose solution but it works.

I save emission as an Extensible array (EArray), because I generate the data in chunks and I need to append each new chunk (I know the final size though). Saving counts is more problematic. If a save it like a pytable array it's difficult to perform queries like "counts >= 2". Therefore I saved counts as multiple tables (one per particle) [UGLY] and I query with .get_where_list('counts >= 2'). I'm not sure this is space-efficient, and generating all these tables instead of using a single array, clobbers significantly the HDF5 file. Moreover, strangely enough, creating those tables require creating a custom dtype (even for standard numpy dtypes):

Each particle-counts "table" has a different name (name = "particle_%d" % ip) and that I need to put them in a python list for easy iteration.

EDIT: The result of this question is a Brownian Motion simulator called PyBroMo.

numpy pandas pytables h5py blaze

2014-01-05T23:55:16.343

0 投票

1 回答

2625 浏览

python - 在 Windows 上安装 h5py 失败 - 可能是 32 位与 64 位的问题？

我刚刚将我的 python 安装切换到可爱的 Anaconda，并试图让我经常使用的软件包安装并再次工作。不幸的是，我在让 h5py 工作时遇到了一些问题 - h5py.org 提供的 Python 2.7 安装程序似乎不起作用。在第一个屏幕之后，会弹出一条错误消息，指出需要 Python 版本 2.7，但在注册表中找不到。Python 2.7 是我在这台计算机上拥有的唯一版本（特别是 2.7.5，Anaconda 1.8.0（64 位））。有没有其他人遇到过这个问题？长期被遗忘的线程中的一些人担心这可能是 32 位与 64 位 python 的问题，但没有任何结果。有什么想法/我是否还需要安装 32 位 python 才能正常工作？

python python-2.7 h5py

2014-01-13T00:52:01.520

0 投票

2 回答

1563 浏览

python - 切片ndarray的最快方法

我有一些来自 HDF5 文件的事件数据：

我得到这样的数组数据：

结构是这样的：

我需要获取第一个字段小于某个值的特定事件的最大索引。蛮力方法是：

元组的第一个索引已排序，因此我可以进行二等分以加快速度：

这显示了改进，但[row[0] for row in event]比我预期的要慢。关于如何解决这个问题的任何想法？

python performance numpy hdf5 h5py

2014-01-17T17:44:03.780

0 投票

0 回答

149 浏览

python - h5py 模块不将字符写入文件

我遇到了 python 的 h5py 模块的问题。

我正在尝试以“r+”模式打开一个文件，并想更改该文件中的一些字符。

每当我覆盖数据时，字符就会被删除。

这就是我得到的：

重新打开文件也无济于事......

我尝试对自己创建的文件做同样的事情，效果很好：

所以我想我正在尝试编辑的特定文件“someFile”有一些特别之处。

非常感谢您的任何建议或帮助！

python h5py

2014-01-22T23:27:12.840

0 投票

1 回答

109 浏览

numpy - h5py广播是指什么？

h5py 文档 ( http://www.h5py.org/docs/high/dataset.html ) 说明如下：

重要的是，h5py 在写入之前不使用 NumPy 进行广播...

在这种情况下，广播指的是什么？

numpy h5py

2014-01-25T16:50:09.413

0 投票

2 回答

20099 浏览

python - 如何使用h5py在python中读取matlab数组

我有一个 matlab 数组 > 2GB ...我想使用 h5py 读取它。数据是一个简单的 3D 双精度数组。但我根本无法在互联网上找到线索。

有人能帮我吗？我只需要一个例子，它是如何完成的。h5py 文档帮不了我。

python h5py

2014-01-31T12:45:30.147

0 投票

0 回答

458 浏览

python - 使用 h5py 高效加载 2D/4D 掩码数组

我将数据存储在 HDF5 文件中的大型 4-D 数组中。每个文件的维度是 (Time,x,y,z)。我想用一些x,y,z 值加载 Time 列的子集，以获得 2D (T,location) 矩阵。对于我要加载的所有 x、y、z 位置，我都有一个 3D 逻辑掩码。我的问题是：我是否最好将存储的数据重新格式化为 2D？

我认为答案将是一个简单的“是”，因为 hdf5 掩码语法允许您将逻辑数组用作单个维度而不是多个维度的掩码。代码示例：

MyTestScript.py

根据我目前对 h5py 的理解，列出的第二个选项（如果“数据”是 2D 数组）似乎应该加载得更快，因为它从不加载整个数组。在与上述类似的代码的首次通过测试中，它的速度确实快得多（快了大约 10 倍 - 我在真实脚本中进行了时间检查）。

但是，在第一次运行代码后，时间差几乎完全消失了。现在，两个版本的加载速度都快得多，好像 hdf5 文件（或代码的解释器？）以某种方式缓存了完全加载的数据。我不知道它是如何做到的，因为我每次都是在单独的 python 会话中进行的（python MyTestScript.py）。

这是怎么回事？加速的来源是什么？我什么时候可以指望它发生？

感谢您的任何见解。

python arrays performance h5py

2014-01-31T21:47:33.893

问题标签 [h5py]

Problem description

Question

What I would like it exist

What are the practical options

Tentative solution (pyTables)

Reference