5

[这篇文章的早期版本的标题不准确“如何将一列添加到 numpy 记录数组中?” 早期标题中提出的问题已经部分回答,但这个答案并不是这篇文章早期版本的正文所要求的。我已经改写了标题,并对帖子进行了大量编辑,以使区别更加清晰。我还解释了为什么我前面提到的答案不符合我的要求。]


假设我有两个numpy数组xy,每个数组由r个“记录”(又名“结构化”)数组组成。设xbe ( r , c x ) 的形状和ybe ( r , c y ) 的形状。我们还假设x.dtype.names和之间没有重叠y.dtype.names

例如,对于r = 2、c x = 2 和c y = 1:

import numpy as np
x = np.array(zip((1, 2), (3., 4.)), dtype=[('i', 'i4'), ('f', 'f4')])
y = np.array(zip(('a', 'b')), dtype=[('s', 'a10')])

我想“水平”连接xy产生一个的记录数组z,具有形状(rc x + c y)。此操作不应该修改x或根本不应该修改y

一般来说,z = np.hstack((x, y))不会这样做,因为dtype' 在x并且y不一定匹配。例如,继续上面的例子:

z = np.hstack((x, y))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-def477e6c8bf> in <module>()
----> 1 z = np.hstack((x, y))
TypeError: invalid type promotion


现在,有一个函数,numpy.lib.recfunctions.append_fields看起来可能会做一些接近我正在寻找的事情,但我无法从中得到任何东西:我尝试过的所有事情要么因错误而失败,要么产生不同于我想要得到的东西。

Can someone please show me explicitly the code (using n.l.r.append_fields or otherwise1) that would generate, from the x and y defined in the example above, a new array of records, z, equivalent to the horizontal concatenation of x and y, and do so without modifying either x or y?

I assume that this will require only one or two lines of code. Of course, I am looking for code that does not require building z, record by record, by iterating over x and y. Also, the code may assume that x and y have the same number of records, and that there is no overlap between x.dtype.names and y.dtype.names. Other than this, the code I'm looking for should know nothing about x and y. Ideally, it should be agnostic also about the number of arrays to join. IOW, leaving out error checking, the code I'm looking for could be the body of a function hstack_rec so that the new array z would be the result hstack_rec((x, y)).


1...although I have to admit that, after my so-far perfect record of failure with numpy.lib.recfunctions.append_fields, I've become a bit curious about how this function could be used at all, irrespective of its relevance to this post's question.

4

2 回答 2

5

I never use recarrays, and so someone else is going to come up with something slicker, but maybe merge_arrays would work?

>>> import numpy.lib.recfunctions as nlr
>>> x = np.array(zip((1, 2), (3., 4.)), dtype=[('i', 'i4'), ('f', 'f4')])
>>> y = np.array(zip(('a', 'b')), dtype=[('s', 'a10')])
>>> x
array([(1, 3.0), (2, 4.0)], 
      dtype=[('i', '<i4'), ('f', '<f4')])
>>> y
array([('a',), ('b',)], 
      dtype=[('s', '|S10')])
>>> z = nlr.merge_arrays([x, y], flatten=True)
>>> z
array([(1, 3.0, 'a'), (2, 4.0, 'b')], 
      dtype=[('i', '<i4'), ('f', '<f4'), ('s', '|S10')])
于 2013-02-18T20:14:43.980 回答
0

This is a very late answer, but maybe it will be helpful to someone else. I used this solution after asking the same question with most of the same criteria.

It doesn't generate a new numpy array, but by using zip and itertools.chain it is much faster. In my case, I needed to access every value of every row in sequential order. Here is benchmark which simulates this use-case:

import numpy
from numpy.lib.recfunctions import merge_arrays
from itertools import chain

a = numpy.empty(3, [("col1", int), ("col2", float)])
b = numpy.empty(3, [("col3", int), ("col4", "U1")])

Results:

%timeit [i for i in (row for row in merge_arrays([a,b], flatten=True))]
52.9 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit [i for i in (row for row in (chain(i,k) for i,k in zip(a,b)))]
3.47 µs ± 52 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
于 2018-02-24T03:59:49.450 回答