python - 了解 Pandas 0.8.1（和 0.11）中的索引问题

Question

这是一个来自 IPython 会话的示例，其中一些简单的索引和对 Pandas DataFrame 的分配工作，而有些在看起来简单时不起作用：

In [652]: dfrm = pandas.DataFrame(np.random.rand(10,3), columns=['A', 'B', 'C'])

In [653]: dfrm
Out[653]:
          A         B         C
0  0.777147  0.558404  0.424222
1  0.906354  0.111197  0.492625
2  0.011354  0.468661  0.056303
3  0.118818  0.117526  0.649210
4  0.746045  0.583369  0.962173
5  0.374871  0.285712  0.868599
6  0.223596  0.963223  0.012154
7  0.969879  0.043160  0.891143
8  0.527701  0.992965  0.073797
9  0.553854  0.969303  0.523098

In [654]: dfrm['A'][dfrm.A > 0.5] = [1,2,3,4,5,6]

In [655]: dfrm
Out[655]:
          A         B         C
0  1.000000  0.558404  0.424222
1  2.000000  0.111197  0.492625
2  0.011354  0.468661  0.056303
3  0.118818  0.117526  0.649210
4  3.000000  0.583369  0.962173
5  0.374871  0.285712  0.868599
6  0.223596  0.963223  0.012154
7  4.000000  0.043160  0.891143
8  5.000000  0.992965  0.073797
9  6.000000  0.969303  0.523098

In [656]: dfrm[['B','C']][dfrm.A > 0.5] = 100*np.random.rand(6,2)

In [657]: dfrm
Out[657]:
          A         B         C
0  1.000000  0.558404  0.424222
1  2.000000  0.111197  0.492625
2  0.011354  0.468661  0.056303
3  0.118818  0.117526  0.649210
4  3.000000  0.583369  0.962173
5  0.374871  0.285712  0.868599
6  0.223596  0.963223  0.012154
7  4.000000  0.043160  0.891143
8  5.000000  0.992965  0.073797
9  6.000000  0.969303  0.523098

In [658]: dfrm[dfrm.A > 0.5] = 100*np.random.rand(6,3)

In [659]: dfrm
Out[659]:
           A          B          C
0  27.738118  18.812116  46.369840
1  35.335223  58.365611   7.773464
2   0.011354   0.468661   0.056303
3   0.118818   0.117526   0.649210
4  97.439481  98.621074  69.816171
5   0.374871   0.285712   0.868599
6   0.223596   0.963223   0.012154
7  53.609637  30.952762  81.379502
8  68.473117  16.261694  91.092718
9  82.253724  94.979991  72.571951

In [660]: dfrm[dfrm.A > 0.5] = 0.5*dfrm[dfrm.A > 0.5]
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-660-35fb8e212806> in <module>()
----> 1 dfrm[dfrm.A > 0.5] = 0.5*dfrm[dfrm.A > 0.5]

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)
   1707             self._boolean_set(key, value)
   1708         elif isinstance(key, (np.ndarray, list)):
-> 1709             return self._set_item_multiple(key, value)
   1710         else:
   1711             # set column

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in _set_item_multiple(self, keys, value)
   1728     def _set_item_multiple(self, keys, value):
   1729         if isinstance(value, DataFrame):
-> 1730             assert(len(value.columns) == len(keys))
   1731             for k1, k2 in zip(keys, value.columns):
   1732                 self[k1] = value[k2]

AssertionError:

谁能解释为什么其中一些（但不是全部）有效，以及为什么最后一个实际上会导致错误？

更新：

我们安装了 Pandas 0.11，但它不是开发的默认版本，所以它现在对我来说只是一个沙盒之类的东西。但是即使我在 0.11 中重复这个例子，我也看到了同样的分配问题，除了最后一个例子现在可以正常工作，没有错误。但是关于如何调用原始 DataFrame 的约定的混乱性__setitem__仍然存在：

Python 2.7.3 |EPD 7.3-2 (64-bit)| (default, Apr 11 2012, 17:52:16)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-44)] on linux2
Type "credits", "demo" or "enthought" for more information.
Hello
>>> import pandas
>>> pandas.__version__
'0.11.0'
>>> dfrm = pandas.DataFrame(np.random.rand(10,3), columns=['A', 'B', 'C'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'np' is not defined
>>> import numpy as np
>>> dfrm = pandas.DataFrame(np.random.rand(10,3), columns=['A', 'B', 'C'])
>>> dfrm
          A         B         C
0  0.745516  0.062613  0.147684
1  0.369141  0.447022  0.114963
2  0.820178  0.946806  0.687971
3  0.771971  0.934799  0.633633
4  0.828249  0.065587  0.848788
5  0.433796  0.740885  0.160140
6  0.663891  0.753134  0.849269
7  0.647054  0.962267  0.453865
8  0.345706  0.030634  0.058697
9  0.994135  0.990536  0.436903
>>> dfrm[dfrm.A > 0.5]
          A         B         C
0  0.745516  0.062613  0.147684
2  0.820178  0.946806  0.687971
3  0.771971  0.934799  0.633633
4  0.828249  0.065587  0.848788
6  0.663891  0.753134  0.849269
7  0.647054  0.962267  0.453865
9  0.994135  0.990536  0.436903
>>> len(dfrm[dfrm.A > 0.5])
7
>>> dfrm['A'][dfrm.A > 0.5] = [1,2,3,4,5,6,7]
>>> dfrm
          A         B         C
0  1.000000  0.062613  0.147684
1  0.369141  0.447022  0.114963
2  2.000000  0.946806  0.687971
3  3.000000  0.934799  0.633633
4  4.000000  0.065587  0.848788
5  0.433796  0.740885  0.160140
6  5.000000  0.753134  0.849269
7  6.000000  0.962267  0.453865
8  0.345706  0.030634  0.058697
9  7.000000  0.990536  0.436903
>>> dfrm[['B','C']][dfrm.A > 0.5] = 100*np.random.rand(7,2)
>>> dfrm
          A         B         C
0  1.000000  0.062613  0.147684
1  0.369141  0.447022  0.114963
2  2.000000  0.946806  0.687971
3  3.000000  0.934799  0.633633
4  4.000000  0.065587  0.848788
5  0.433796  0.740885  0.160140
6  5.000000  0.753134  0.849269
7  6.000000  0.962267  0.453865
8  0.345706  0.030634  0.058697
9  7.000000  0.990536  0.436903
>>> dfrm[dfrm.A > 0.5] = 0.5*dfrm[dfrm.A > 0.5]
>>> dfrm
          A         B         C
0  0.500000  0.031306  0.073842
1  0.369141  0.447022  0.114963
2  1.000000  0.473403  0.343985
3  1.500000  0.467400  0.316816
4  2.000000  0.032794  0.424394
5  0.433796  0.740885  0.160140
6  2.500000  0.376567  0.424635
7  3.000000  0.481133  0.226933
8  0.345706  0.030634  0.058697
9  3.500000  0.495268  0.218452
>>>

第二次更新：

这是另一个超级出乎意料的行为：

In [681]: id(dfrm.A)
Out[681]: 298480536

In [682]: id(dfrm.A)
Out[682]: 298480536

In [683]: id(dfrm.A)
Out[683]: 298480536

In [684]: id(dfrm['A'])
Out[684]: 298480536

In [685]: id(dfrm['A'])
Out[685]: 298480536

In [686]: id(dfrm['A'])
Out[686]: 298480536

In [687]: id(dfrm[['A']])
Out[687]: 281536912

In [688]: id(dfrm[['A']])
Out[688]: 281535824

In [689]: id(dfrm[['A']])
Out[689]: 281536336

score 1 · Accepted Answer

根据情况分配两个或多个 getitems/slices（链接）可能会或可能不会起作用......
所以你应该避免这样做！您应该重写以一次性完成每一项。

在 0.11（可能之前）中有相当多的工作来清除这种行为......现在 pandas 重载这些任务以不在乎它是视图还是副本，如果你是一次性执行此操作，你应该正在做，一般来说。
例如：

dfrm.loc[dfrm.A > 0.5, 'A'] = [1, 2, 3, 4, 5, 6]

dfrm.loc[[dfrm.A > 0.5], ['B','C']] = 100 * np.random.rand(6, 2)

此外，通常是指定您按标签（使用 loc）索引的好习惯：

dfrm.loc[dfrm.A > 0.5] = 100 * np.random.rand(6, 3)

你也可以考虑重写：

dfrm.loc[dfrm.A > 0.5] = 0.5 * dfrm.loc[dfrm.A > 0.5]

至

dfrm.loc[dfrm.A > 0.5] *= 0.5

这是 0.8.1 中的一个令人惊讶的错误（但似乎在以后的版本中已修复），也许一种解决方法（如果上述方法不起作用）是先设置花哨的索引 ( df_A_gt_half = dfrm.A > 0.5)，然后使用它进行分配.. . 并且被迫使用ix而不是loc.

python - 了解 Pandas 0.8.1（和 0.11）中的索引问题

1 回答 1

Related

Reference