pandas - 将分类变量级别更改为我提供的/合并级别两个分类变量

Question

下面的情况经常出现在我的数据分析中。假设我有两个数据向量，x 和 y，来自一些观察。x 有更多的数据点，因此包含一些在 y 中没有观察到的值。现在我想把它们变成分类变量。

x=['a','b','c','d','e']  #data points
y =['a','c','e']         #data of the same nature as x but with fewer data points  

fx = pandas.Categorical.from_array(x)
fy = pandas.Categorical.from_array(y)

print fx.index
print fy.index

Categorical: 
array([a, b, c, d, e], dtype=object)
Levels (5): Index([a, b, c, d, e], dtype=object) Categorical: 
array([a, c, e], dtype=object)
Levels (3): Index([a, c, e], dtype=object)

我看到现在它们有不同的级别，标签意味着不同的东西（1 在 fx 中表示 b，但在 fy 中表示 c）。

这显然使得同时使用 fx 和 fy 的代码变得困难，因为他们期望 fx.labels 和 fy.labels 具有相同的编码/含义。

但是我看不到如何“规范化” fx 和 fy 以使它们具有相同的级别并fx.lables具有fy.lables相同的编码。fy.labels = fx.lables显然不起作用。如下所示，它将标签 [ace] 的含义更改为 [abc]。

fy.levels = fx.levels
print fy

Categorical: 
array([a, b, c], dtype=object)
Levels (5): Index([a, b, c, d, e], dtype=object)

有没有人有任何想法？

另一个相关场景是我有一个现有的已知索引，并且希望将数据分解到该索引中。例如，我知道每个数据点都必须采用五个值之一 [a, b, c, d, e] 并且我已经有一个索引Index([a, b, c, d, e], dtype=object)并且我想分解向量 y=['a','c' ,'e'] 转换为一个分类变量，Index([a, b, c, d, e], dtype=object)其级别为。我也不确定如何做到这一点，并希望知道的人提供一些线索。

PS在R中做这样的事情是可能的但很麻烦。

谢谢，汤姆

score 4 · Accepted Answer

In [6]: fxd = {fx.levels[i]: i for i in range(len(fx.levels))}

In [7]: fy.labels = [fxd[v] for v in fy]

In [8]: fy.levels = fx.levels

In [9]: fy
Out[9]: 
Categorical: 
array([a, c, e], dtype=object)
Levels (5): Index([a, b, c, d, e], dtype=object)

score 4 · Accepted Answer

该get_indexer()方法可用于创建索引数组：

x=['a','b','c','d','e']  #data points
y =['a','c','e']         #data of the same nature as x but with fewer data points  
idx = pd.Index(pd.unique(x+y))
cx = pd.Categorical(idx.get_indexer(x), idx)
cy = pd.Categorical(idx.get_indexer(y), idx)

score 0 · Accepted Answer

关于加勒特的回答：在我的熊猫版本（0.20.3）中fx.levels引发了一个 AttributeError: 'Categorical' object has no attribute 'levels'，但有效的是：

missing_levels = set(fx) - set(fy)
fy = fy.add_categories(missing_levels)

或inplace=True（快一点）：

missing_levels = set(fx) - set(fy)
fy.add_categories(missing_levels, inplace=True)

pandas - 将分类变量级别更改为我提供的/合并级别两个分类变量

3 回答 3

Related

Reference