python - 当数组具有不同长度的字符串时，记录数组上的 numpy.concatenate 失败

Question

当尝试连接具有 dtype 字符串字段但长度不同的记录数组时，连接失败。

正如您在以下示例中所见，如果 'f1' 的长度相同，则连接有效，但如果不是，则连接失败。

In [1]: import numpy as np

In [2]: a = np.core.records.fromarrays( ([1,2], ["one","two"]) )

In [3]: b = np.core.records.fromarrays( ([3,4,5], ["three","four","three"]) )

In [4]: c = np.core.records.fromarrays( ([6], ["six"]) )

In [5]: np.concatenate( (a,c) )
Out[5]: 
array([(1, 'one'), (2, 'two'), (6, 'six')], 
      dtype=[('f0', '<i8'), ('f1', '|S3')])

In [6]: np.concatenate( (a,b) )
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/u/jegannas/<ipython console> in <module>()

TypeError: expected a readable buffer object

但是，如果我们只是连接数组（而不是记录），它会成功，尽管字符串的大小不同。

In [8]: np.concatenate( (a['f1'], b['f1']) )
Out[8]: 
array(['one', 'two', 'three', 'four', 'three'], 
      dtype='|S5')

这是连接记录时连接中的错误还是预期的行为。我想出了以下方法来克服这个问题。

In [10]: np.concatenate( (a.astype(b.dtype), b) )
Out[10]: 
array([(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'three')], 
      dtype=[('f0', '<i8'), ('f1', '|S5')]

但是这里的问题是我必须遍历所有的recarrays，我正在连接并找到最大的字符串长度，我必须使用它。如果记录数组中有多个字符串列，我还需要跟踪其他一些事情。

你认为克服这个问题的最好方法是什么，至少现在是这样？

score 6 · Accepted Answer

发布完整的答案。正如 Pierre GM 建议的模块：

import numpy.lib.recfunctions

给出了解决方案。然而，做你想要的功能是：

numpy.lib.recfunctions.stack_arrays((a,b), autoconvert=True, usemask=False)

（usemask=False只是为了避免创建一个您可能没有使用的屏蔽数组。重要的是autoconvert=True强制从a's转换dtype "|S3"为"|S5"）。

score 2 · Accepted Answer

当您不指定 dtype 时，np.rec.fromarrays（aka np.core.records.fromarrays）会尝试为您猜测 dtype。因此，

In [4]: a = np.core.records.fromarrays( ([1,2], ["one","two"]) )

In [5]: a
Out[5]: 
rec.array([(1, 'one'), (2, 'two')], 
      dtype=[('f0', '<i4'), ('f1', '|S3')])

请注意，该列的 dtypef1是一个 3 字节的字符串。

您不能连接np.concatenate( (a,b) )，因为 numpy 看到和的 dtypea不同b，并且不会更改较小字符串的 dtype 以匹配较大的字符串。

如果您知道适用于所有数组的最大字符串大小，则可以从头开始指定 dtype：

In [9]: a = np.rec.fromarrays( ([1,2], ["one","two"]), dtype = [('f0', '<i4'), ('f1', '|S8')])

In [10]: b = np.core.records.fromarrays( ([3,4,5], ["three","four","three"]), dtype = [('f0', '<i4'), ('f1', '|S8')])

然后连接将按需要工作：

In [11]: np.concatenate( (a,b))
Out[11]: 
array([(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'three')], 
      dtype=[('f0', '<i4'), ('f1', '|S8')])

如果您事先不知道字符串的最大长度，可以将 dtype 指定为“object”：

In [35]: a = np.core.records.fromarrays( ([1,2], ["one","two"]), dtype = [('f0', '<i4'), ('f1', 'object')])

In [36]: b = np.core.records.fromarrays( ([3,4,5], ["three","four","three"]), dtype = [('f0', '<i4'), ('f1', 'object')])

In [37]: np.concatenate( (a,b))
Out[37]: 
array([(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'three')], 
      dtype=[('f0', '<i4'), ('f1', '|O4')])

这不会像'|Sn'（对于某些 integer n）的 dtype 那样节省空间，但至少它允许您执行concatenate操作。

score 2 · Accepted Answer

会numpy.lib.recfunctions.merge_arrays为你工作吗？recfunctions是一个鲜为人知的子包，没有做很多广告，它有点笨重，但有时可能很有用。

python - 当数组具有不同长度的字符串时，记录数组上的 numpy.concatenate 失败

3 回答 3

Related