0 投票

1 回答

377 浏览

python - Blaze 中的“合成维度”是什么？

Blaze 自述文件（此处为https://github.com/ContinuumIO/blaze）描述了对 NumPy 的许多改进，包括“合成维度”。我四处寻找，但无法找出它们是什么。

有人可以启发我吗？

谢谢。

2013-01-02T13:22:14.383

0 投票

1 回答

2485 浏览

python - 如何在 Python 中安装 blaze 模块（Continuum 分析）？

如何在 Python 中本地（即不在虚拟环境中）安装blaze ？我找到的唯一说明在包的文档中（见链接），在这里，在虚拟环境中。

python numpy numeric blaze

2013-01-25T17:41:01.110

0 投票

0 回答

181 浏览

python - 在 Python 中构建通用 2D/3D/ND 数据集的最强大和交互友好的方法是什么？

我是一名科学家，最近从 MATLAB 转换为 Python。我正在寻找构建（主要是 2D 和 3D）数据集的方法。我在网上搜索了很多，在我看来，Python 中健壮且通用的数据结构仍然有些悬而未决。我认为这个问题和任何答案对于其他寻求以允许专注于手头问题而不是底层实现的方式构建数据的方法的 Python 科学家来说将是高度相关的。

我的数据结构的一个例子是时间 x 高度 x 参数，其中参数是例如密度、温度等。对于时间维度，我想使用datetime对象，因为这看起来非常健壮并且便于转换、格式化等.

到目前为止，我已经研究了 Pandas 和 MetaArray（来自 SciPy 食谱）。

Pandas 作为一种数据类型的主要缺点是它远不止这些。例如，面板中的每个维度（项目、长轴、短轴）似乎都有某些首选用途，尽管我不知道是哪个。特别是索引因维度而异，有些维度在创建数据结构后可能无法展开。因此，尽管 Pandas 的一些功能（如分组.groupby（类型。

我还简要研究了 SciPy 食谱中的MetaArray。这看起来更像是一种简洁的数据类型，索引看起来非常直观和灵活，使其更适合交互式科学工作。但是，它不是（AFAIK）任何软件包的一部分，需要手动下载和安装，如果我需要与其他科学家合作，这使得可移植性更加困难。此外，我发现几乎没有使用它的例子，因此它似乎更像是结构化 N 维数据集问题的临时解决方案。

我也听说过 Blaze，号称“下一代 NumPy”，但据我所知，它仍处于早期开发阶段。（欢迎体验 Blaze！）

因此，我想要一些示例（模块、包等），说明如何在 Python 中构建 N 维数据集（特别是 3D），最重要的是为了方便交互使用。

python data-structures numpy dataset blaze

2013-11-21T12:13:19.150

0 投票

5 回答

1780 浏览

numpy - Python particles simulator: out-of-core processing

Problem description

In writing a Monte Carlo particle simulator (brownian motion and photon emission) in python/numpy. I need to save the simulation output (>>10GB) to a file and process the data in a second step. Compatibility with both Windows and Linux is important.

The number of particles (n_particles) is 10-100. The number of time-steps (time_size) is ~10^9.

The simulation has 3 steps (the code below is for an all-in-RAM version):

Simulate (and store) an emission rate array (contains many almost-0 elements):
- shape (n_particles x time_size), float32, size 80GB
Compute counts array, (random values from a Poisson process with previously computed rates):
- shape (n_particles x time_size), uint8, size 20GB
  /li>
Find timestamps (or index) of counts. Counts are almost always 0, so the timestamp arrays will fit in RAM.
/li>

I do step 1 once, then repeat step 2-3 many (~100) times. In the future I may need to pre-process emission (apply cumsum or other functions) before computing counts.

Question

I have a working in-memory implementation and I'm trying to understand what is the best approach to implement an out-of-core version that can scale to (much) longer simulations.

What I would like it exist

I need to save arrays to a file, and I would like to use a single file for a simulation. I also need a "simple" way to store and recall a dictionary of simulation parameter (scalars).

Ideally I would like a file-backed numpy array that I can preallocate and fill in chunks. Then, I would like the numpy array methods (max, cumsum, ...) to work transparently, requiring only a chunksize keyword to specify how much of the array to load at each iteration.

Even better, I would like a Numexpr that operates not between cache and RAM but between RAM and hard drive.

What are the practical options

As a first option I started experimenting with pyTables, but I'm not happy with its complexity and abstractions (so different from numpy). Moreover my current solution (read below) is UGLY and not very efficient.

So my options for which I seek an answer are

implement a numpy array with required functionality (how?)
use pytable in a smarter way (different data-structures/methods)
use another library: h5py, blaze, pandas... (I haven't tried any of them so far).

Tentative solution (pyTables)

I save the simulation parameters in '/parameters' group: each parameter is converted to a numpy array scalar. Verbose solution but it works.

I save emission as an Extensible array (EArray), because I generate the data in chunks and I need to append each new chunk (I know the final size though). Saving counts is more problematic. If a save it like a pytable array it's difficult to perform queries like "counts >= 2". Therefore I saved counts as multiple tables (one per particle) [UGLY] and I query with .get_where_list('counts >= 2'). I'm not sure this is space-efficient, and generating all these tables instead of using a single array, clobbers significantly the HDF5 file. Moreover, strangely enough, creating those tables require creating a custom dtype (even for standard numpy dtypes):

Each particle-counts "table" has a different name (name = "particle_%d" % ip) and that I need to put them in a python list for easy iteration.

EDIT: The result of this question is a Brownian Motion simulator called PyBroMo.

numpy pandas pytables h5py blaze

2014-01-05T23:55:16.343

0 投票

2 回答

845 浏览

python - Anaconda Python 中的最小 Blaze 示例

我正在尝试在Anaconda 安装（Python 3.3）中获取一个简单的Blaze 示例，在 Ubuntu 上工作。

但是运行给了我这个错误：

但是，我可以导入datashape，使用

当我尝试时conda install <pkgname>，我得到满足的依赖关系。我认为这与这个问题有关，但我觉得这个建议难以理解。

任何帮助表示赞赏。

python anaconda blaze datashape

2014-03-14T14:38:47.827

0 投票

3 回答

282 浏览

python - 构建 Blaze 需要什么 Clang++？

好奇Blaze（下一代 NumPy）会是什么样子，我尝试安装

tarballblaze-0.1.tar.gz已下载，但出现错误：

问题：

是什么Clang++？我认为它不是 Python 包/模块。它似乎与C++有关。

安装什么来满足这个要求？我正在使用 Ubuntu 14.04。

提前致谢！

python c++numpy blaze

2014-06-25T10:31:55.367

0 投票

1 回答

415 浏览

python - python blaze列之间的相关性

有一个关于如何使用 python blaze 模块进行分析的简单问题。所以，我正在尝试执行此代码：

在这里我得到了这个错误：

在阅读了一些 blaze 文档后，我发现问题在于将 blaze 列转换为如下结构：

但是这种转换会使 pearsonr 在列列表上的迭代计算变慢。那么，我怎样才能简单地将 blaze 列转换为 np.array 以使用计算（如 pearsonr 或 statsmodels.api.Logit(blz_frame.y,blz_frame[[train_cols]])？）如果有意义，我正在使用Anaconda for Python 3.4，我的 blaze 版本：

python statsmodels blaze

2014-10-31T11:13:25.467

0 投票

2 回答

785 浏览

python - 在 Blaze Table 中过滤日期

我正在使用 Blaze (0.6.3) 和 Anaconda 2.1.0（在 Python 2.7.8 上）。我正在尝试使用基于表行日期的过滤器。

模拟 TSV 文件如下：

蟒蛇代码是

前两个过滤器没问题，但第三个会抛出SyntaxError.

这一切似乎归结为以下几点：

这在语法上是无效的。不知何故，某处，datetime(1970,1,1)被翻译成datetime(1970-01-01 00:00:00)，然后datetime被遗忘了。Blaze 本身可以识别date带有?datetime类型的列，这正是我想要的，但是在比较中它失败了。

我是否以错误的方式使用它？

python datetime anaconda blaze

2014-11-10T14:14:51.347

0 投票

1 回答

130 浏览

python - Blaze 查询中的错误/错误

我正在尝试使用 python 模块 blaze。当我在小型数据集上使用它时，它可以工作。当我转向更大、更复杂的数据集时，我遇到了错误。我在下面举了一个例子。鉴于该错误，blaze 似乎无法将第一列转换为日期。如何将特定列的 dtype 指定为字符串，以便 Blaze 不会尝试解析。谢谢。

python blaze

2014-12-11T17:25:53.743

0 投票

1 回答

955 浏览

python - pydata blaze：它是否允许并行处理？

我希望并行化 numpy 或 pandas 操作。为此，我一直在研究 pydata 的blaze。我的理解是无缝并行是它的主要卖点。

不幸的是，我一直无法找到在多个核心上运行的操作。blaze 中的并行处理是否可用，或者目前只是一个既定目标？难道我做错了什么？我正在使用 blaze v0.6.5。

我希望并行化的一个函数的示例：（pytables 列的重复数据删除太大而无法放入内存）

编辑 1

我在遵循 Phillip 的例子时遇到了问题：

我的环境：

但请注意，blaze 似乎报告了错误的版本：

与其他数据源 blaze 似乎工作：

python numpy pandas multiprocessing blaze

2014-12-16T13:27:36.143

问题标签 [blaze]

Problem description

Question

What I would like it exist

What are the practical options

Tentative solution (pyTables)

编辑 1

Reference