0 投票

2 回答

6581 浏览

r - 如何在 R 中读取 .hdf 文件？

我有大量 .hdf 格式的文件。遗憾的是，这些不是我知道在 R 中可读的 hdf5 文件。有没有办法在 R 中加载和读取 hdf 文件？或者有没有办法将 .hdf 转换为 hdf5？我下载了基于C语言的h4toh5工具，但是没有用。有没有其他的转换方式？非常感谢。

r hdf

2014-07-31T11:00:09.407

0 投票

1 回答

800 浏览

python - Pandas - read_hdf 或 store.select 返回不正确的查询结果

我有一个通过 pandas store.append 存储的大型数据集（400 万行，50 列）。当我使用 store.select 或 read_hdf 查询大于某个值的 2 列时（即“(a > 10) & (b > 1)”，我得到大约 15,000 行返回。

当我阅读整个表格时，例如 df，并执行 df[(df.a > 10) & (df.b > 1)] 我得到 30,000 行。我缩小了问题的范围——当我读入整个表格并执行 df.query("(a > 10) & (b > 1)") 时，它是相同的 15,000 行，但是当我将引擎设置为 python 时—— > df.query("(a > 10) & (b > 1)", engine = 'python') 我得到了 30,000 行。

我怀疑这与在 HDF 和 Query 方法中查询的 eval/numexpr 方法有关。

类型是 a 和 b 列中的 float64，即使我使用浮点数（即 1. 而不是 1）进行查询，问题仍然存在。

我将不胜感激任何反馈，或者如果其他人有同样的问题，我们需要解决这个问题。

问候，尼尔

=========================

这是信息：

pd.show_versions()

df.info() ---> 在选定的 15,000 行左右

ptdump -av 文件

这是我在表格中的阅读方式：

这是我编写/填写表格的方式：

编辑：响应提供一些示例数据的请求....

数据

为清楚起见，我们称结果为 15,000：“错误”让我们称结果为 30,000：“正确”让我们称项目为正确但不正确：“仅正确”

我已经确认，不正确中的所有行/项目都完全在正确中找到。

这里有几行数据（每行只取了 10000 和 10001 行）：

仅正确：

不正确：

正确的：

2014-07-31T21:57:12.600

0 投票

0 回答

435 浏览

matlab - 是否可以在 Matlab 中解压缩/压缩 HDF.Z 文件

我想知道是否可以在 Matlab 中解压缩和重新压缩 *.HDF.Z 文件。如果可能的话，请你告诉我好吗？非常感谢！

这是我目前的代码片段。

matlab zip unzip hdf

2014-08-01T19:03:08.893

0 投票

2 回答

1513 浏览

python - 使用 astype 在 H5py 中创建对 HDF 数据集的引用

从h5py 文档中，我看到我可以使用数据集的astype方法将 HDF 数据集转换为另一种类型。这会返回一个上下文管理器，它会即时执行转换。

但是，我想读取存储为的数据集uint16，然后将其转换为float32类型。此后，我想以与 cast type 不同的函数从该数据集中提取各种切片float32。文档将用途解释为

这将导致整个数据集被读入并转换为float32，这不是我想要的。我想引用数据集，但float32转换为numpy.astype. 如何创建对.astype('float32')对象的引用，以便可以将其传递给另一个函数以供使用？

一个例子：

此外，似乎 astype 上下文仅在访问数据元素时才适用。这意味着

那么没有使用 astype 的 numpy-esque 方式吗？

python numpy h5py hdf

2014-08-11T10:26:33.400

0 投票

0 回答

208 浏览

emr - 如何在 Amazon EMR 上启用 HDFS 缓存？

在 EMR 上启用HDFS 缓存的最简单方法是什么？

更具体地说，如何在所有节点上设置dfs.datanode.max.locked.memory和增加“可以锁定到内存中的最大大小”（）？ulimit -l

以下代码似乎可以正常工作dfs.datanode.max.locked.memory，我可能会编写一个自定义引导程序来更新/usr/lib/hadoop/hadoop-daemon.sh和调用ulimit. 有没有更好或更快的方法？

emr hdf

2014-09-19T16:40:03.070

0 投票

0 回答

511 浏览

matlab - 如何从 HDFS 为 Hadoop 读取 HDF 数据

我正在研究 Hadoop 上的图像处理。我正在使用 HDF 卫星数据进行处理，我可以在 hadoop 流中访问和使用 jpg 和其他图像类型的数据。但是在使用 HDF 数据时会出现错误。Hadoop 无法从 HDFS 读取 HDF 数据。显示错误也需要二十多分钟。我的 HDF 数据大小超过 150MB 的单个文件。

如何解决这个问题呢。如何让hadoop可以从HDFS读取这个HDF数据。

我的一些代码

错误日志是：

请任何人都可以帮我解决这个问题。

matlab hadoop hive distributed-computing hdf

2014-09-26T10:35:50.743

0 投票

2 回答

266 浏览

matlab - 如何在MATLAB中存储带有颜色图的图像

我正在使用 HDF 卫星数据来检索波段，我正在得出不同的植被指数。hdf 数据中的每个波段都是灰色格式，它是一个灰度图像。处理 HDF 数据后，我可以使用颜色图将其转换为颜色（我使用 jet 作为颜色图）。我的疑问是如何在使用 imwrite 时将灰度图像转换为彩色映射。如何在 imwrite 中使用颜色图。我尝试了很多次，但输出只有全蓝色，这会破坏输出图像。请帮助我做到这一点。

matlab image-processing octave hdf

2014-09-29T06:29:40.570

0 投票

1 回答

1081 浏览

python - Pandas HDFStore - 从多个表中获取最后一条记录

我有大量的数据帧通过 Pandas 导出到一系列 HDFStore 文件中。我需要能够根据需要快速提取每个数据帧的最新记录。

设置：

我在每个 HDF 文件中存储了大约 100 个数据帧，并且有大约 5000 个文件要运行。HDFStore 中的每个数据帧都使用 DateTimeIndex 进行索引。

对于单个文件，我目前正在遍历HDFStore.keys()，然后使用tail(1)如下方式查询数据框：

有没有更好的方法来做到这一点，也许是HDFStore.select_as_multiple？即使选择最后一条记录而不将整个数据框拉到尾部也可能会大大加快速度。如何才能做到这一点？

python pandas hdfstore hdf

2014-10-15T00:21:31.807

0 投票

0 回答

1604 浏览

python - Pandas - Optimal persistence strategy for highest compression ratio?

Question

Given a large series of DataFrames with a small variety of dtypes, what is the optimal design for Pandas DataFrame persistence/serialization if I care about compression ratio first, decompression speed second, and initial compression speed third?

Background:

I have roughly 200k dataframes of shape [2900,8] that I need to store in logical blocks of ~50 data frames per file. The data frame contains variables of type np.int8, np.float64. Most data frames are good candidates for sparse types, but sparse is not supported in HDF 'table' format stores (not that it would even help - see the size below for a sparse gzipped pickle). Data is generated daily and currently adds up to over 20GB. While I'm not bound to HDF, I have yet to find a better solution that allows for reads on individual dataframes within the persistent store, combined with top quality compression. Again, I'm willing to sacrifice a little speed for better compression ratios, especially since I will need to be sending this all over the wire.

There are a couple of other SO threads and other links that might be relevant for those that are in a similar position. However most of what I've found doesn't focus on minimizing storage size as a priority:

“Large data” work flows using pandas

HDF5 and SQLite. Concurrency, compression & I/O performance [closed]

Environment:

Example:

Results

Given the results above, the best 'compression-first' solution appears to be to store the data in HDF fixed format, with bzip2. Is there a better way of organising the data, perhaps without HDF, that would allow me to save even more space?

Update 1

Per the comment below from Jeff, I have used ptrepack on the table store HDF file without initial compression -- and then recompressed. Results are below:

Oddly, recompressing with ptrepack seems to increase total file size (at least in this case using table format with similar compressors).

python serialization pandas persistence hdf

2014-10-17T14:25:29.077

0 投票

1 回答

403 浏览

java - NetCDF 4.5 NetCDF 文件版本 4 的 Java 问题 + HDF 的旧代码不起作用

我有 NetCDF 版本 3 的文件。我使用 Windows 的最新 ncks（2014 年 10 月 1 日发布）重新分块我的文件 ncks -4 --cnk_dmn lat,4 --cnk_dmn lon,4 --cnk_dmn time,512 2014.nc 2014_chunked.nc ，生成了 NetCDF 版本 4 的 2014_chunked.nc 文件

例如，WCT 可以读取新文件 2014_chunked.nc。但是，java代码产生

并抛出异常

代码是

我将最新的 NetCDF4.5 用于 JRE 7 http://www.unidata.ucar.edu/downloads/netcdf/netcdf-java-4/index.jsp

我查看了 netcdf jar 文件，发现 Nc4.class 的长度只有几个字节，因此该 jar 没有 NetCDF4 iosp，并且对 NetCDF 4 文件使用 H5iosp。

我想新版本的 NetCDF 4.5 java 必须与我一直使用的 4.2 版本略有不同，因为用于打开 netcdf 4.2 的 HDF5 和 HDF4 文件的相同 java 代码可以正常工作，但对于 4.5 则不能：

怎么了？

java hdf5 netcdf hdf nco

2014-10-21T08:14:20.563

问题标签 [hdf]

这是信息：

pd.show_versions()

df.info() ---> 在选定的 15,000 行左右

ptdump -av 文件

这是我在表格中的阅读方式：

这是我编写/填写表格的方式：

数据

这里有几行数据（每行只取了 10000 和 10001 行）：

仅正确：

不正确：

正确的：

Update 1

Reference