python - 如何将布尔的numpy记录数组转换为整数以计算协方差？

Question

我有大约 40 个维度的大约 50 万个条目的记录数组。维度是数据类型的混合。我想子选择 5 个布尔维度并获取大约 1k 个条目的块，然后计算协方差矩阵以查看维度相关性。我完全不知道如何使用.view()或.astype()进行这种转换。初始子选择：

p_new[['no_gender', 'no_age', 'no_income', 'no_politics', 'no_edu']]
array([(False, False, True, False, False), (True, True, False, True, True),
       (True, True, False, True, True), ...,
       (True, True, True, True, True), (True, True, True, True, True),
       (True, True, True, True, True)], 
      dtype=[('no_gender', '|b1'), ('no_age', '|b1'), ('no_income', '|b1'), ('no_politics', '|b1'), ('no_edu', '|b1')])

我所有的转换尝试都将我的 5 个维度折叠为 1（不需要！），所以我最终(1000,5) dtype=np.bool得到.(1000,5) dtype=np.int32(1000,1) dtype=np.int32

score 1 · Accepted Answer

Notice that in a recarray each record is treated as a single element, ie for the following array the shape is (3,) not (3, 5).

A = np.array([('joe', 44, True, True, False),
              ('jill', 22, False, False, False),
              ('Jack', 21, True, False, True)],
             dtype=[['name', 'S4'], ['age', int], ['x', bool],
                    ['y', bool], ['z', bool]])
print A.shape
# (3,)

The easiest way to do what you're asking for is probably something like:

tmp = [A[field] for field in ['x', 'y', 'z']]
tmp = np.array(tmp, dtype=int)

You might also be able to use views, but using views for arrays with mixed data types can get kind of tricky.

score 1 · Accepted Answer

您实际上根本不必将布尔值转换为整数。在 Python 中，True并且False实际上是的子类int，因此您可以像往常一样简单地对它们进行所有数学运算。 True是1和False是0。

证明：

>>> isinstance(True, int)
True
>>> isinstance(False, int)
True
>>> (True + True * 3) / (True + False)
4

虽然我承认，但我不能 100% 确定numpy数据类型以及它如何与您尝试做的事情发生关系。

更新

稍微numpy研究一下数据类型，它们似乎确实表现出相似但不完全相同的行为。 numpy.bool字面上与相同bool，它只是标准的 Python 布尔值，所以它肯定表现出所有相同的行为并且可以用作整数。但是，numpy.int32是从单独子类化的int，所以isinstance(numpy.bool(1), numpy.int32)自然地计算为False。也许你直接去int/会少一些麻烦numpy.int？

score 1 · Accepted Answer

您可以创建一个新的 dtype，然后使用a.astype(new_dtype)：

In [44]: a
Out[44]: 
array([(False, False, True, False, False), (True, True, False, True, True),
       (True, True, False, True, True), (True, True, True, True, True),
       (True, True, True, True, True), (True, True, True, True, True)], 
      dtype=[('no_gender', '|b1'), ('no_age', '|b1'), 
             ('no_income', '|b1'), ('no_politics', '|b1'), ('no_edu', '|b1')])

In [45]: new_dtype = np.dtype([(name, np.int) for name in a.dtype.names])

In [46]: a.astype(new_dtype)
Out[46]: 
array([(0, 0, 1, 0, 0), (1, 1, 0, 1, 1), (1, 1, 0, 1, 1), (1, 1, 1, 1, 1),
       (1, 1, 1, 1, 1), (1, 1, 1, 1, 1)], 
      dtype=[('no_gender', '<i8'), ('no_age', '<i8'), ('no_income', '<i8'),
             ('no_politics', '<i8'), ('no_edu', '<i8')])

score 1 · Accepted Answer

我猜你的问题是你在改变类型时对整行进行操作。如果您将其视为 bool 数组，您将获得所有值，然后您可以执行astype. 但是你必须重塑。

pnew.view("bool").astype(int).reshape(len(pnew),-1)

更容易使用.tolist()，但可能会使用更多内存并且可能会更慢。

asarray(pnew.tolist()).astype(int)

python - 如何将布尔的numpy记录数组转换为整数以计算协方差？

4 回答 4

更新

Related

Reference