python - 行数不匹配的 Python Pandas 和 NumPy.where 行为

Question

我在下面的所有示例中都使用了 Pandas 0.8.1，但我可以确认，当我使用 Pandas 0.11 时，相同的示例对我的工作方式相同。

依赖于将 Pandas 版本更改为较新版本的解决方案不适用于我当前的问题（尽管请随时添加评论（而不是答案）关于这是否在较新的 Pandas 版本中得到修复）。

我有一个示例 Pandas DataFrame 对象

In [20]: dfrm
Out[20]:
          A         B         C     D
0  1.202034 -0.285256  0.392160     0
1  1.799628 -0.169389 -0.305984     3
2  1.262144 -1.165034 -1.780316     6
3 -0.355975  1.610605  1.298506  None
4 -0.139220  0.024292  0.132928    12
5  0.921821 -0.109189 -0.539100    15
6  0.987901 -1.253987 -1.139684    18
7  2.170929  0.520814 -0.139740   NaN
8 -2.329704 -0.475419  1.473144    24
9  1.161275  0.918900 -1.077892    27

首先，我对我看到的类型错误有点困惑。如果我尝试使用numpy.where创建特定列的不同子集的一些字符串标签，看起来标签的字符串性质会产生错误。

In [21]: np.where(dfrm['D'] > 12, 'L', 'S')
Out[21]: ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-a40c5cd8713c> in <module>()
----> 1 np.where(dfrm['D'] > 12, 'L', 'S')

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/displayhook.pyc in __call__(self, result)
    236             self.start_displayhook()
    237             self.write_output_prompt()
--> 238             format_dict = self.compute_format_data(result)
    239             self.write_format_data(format_dict)
    240             self.update_user_ns(result)

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/displayhook.pyc in compute_format_data(self, result)
    148             MIME type representation of the object.
    149         """
--> 150         return self.shell.display_formatter.format(result)
    151
    152     def write_format_data(self, format_dict):

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/formatters.pyc in format(self, obj, include, exclude)
    124                     continue
    125             try:
--> 126                 data = formatter(obj)
    127             except:
    128                 # FIXME: log the exception

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/formatters.pyc in __call__(self, obj)
    445                 type_pprinters=self.type_printers,
    446                 deferred_pprinters=self.deferred_printers)
--> 447             printer.pretty(obj)
    448             printer.flush()
    449             return stream.getvalue()

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/lib/pretty.pyc in pretty(self, obj)
    358                             if callable(meth):
    359                                 return meth(obj, self, cycle)
--> 360             return _default_pprint(obj, self, cycle)
    361         finally:
    362             self.end_group()

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/lib/pretty.pyc in _default_pprint(obj, p, cycle)
    478     if getattr(klass, '__repr__', None) not in _baseclass_reprs:
    479         # A user-provided repr.
--> 480         p.text(repr(obj))
    481         return
    482     p.begin_group(1, '<')

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/series.pyc in __repr__(self)
    772             result = self._get_repr(print_header=True,
    773                                     length=len(self) > 50,
--> 774                                     name=True)
    775         else:
    776             result = '%s' % ndarray.__repr__(self)

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/series.pyc in _get_repr(self, name, print_header, length, na_rep, float_format)
    833                                         length=length, na_rep=na_rep,
    834                                         float_format=float_format)
--> 835         return formatter.to_string()
    836
    837     def __str__(self):

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in to_string(self)
    109
    110         fmt_index, have_header = self._get_formatted_index()
--> 111         fmt_values = self._get_formatted_values()
    112
    113         maxlen = max(len(x) for x in fmt_index)

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in _get_formatted_values(self)
    100         return format_array(self.series.values, None,
    101                             float_format=self.float_format,
--> 102                             na_rep=self.na_rep)
    103
    104     def to_string(self):

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in format_array(values, formatter, float_format, na_rep, digits, space, justify)
    460                         justify=justify)
    461
--> 462     return fmt_obj.get_result()
    463
    464

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in get_result(self)
    479             fmt_values = self._format_strings(use_unicode=True)
    480         else:
--> 481             fmt_values = self._format_strings(use_unicode=False)
    482
    483         return _make_fixed_width(fmt_values, self.justify)

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in _format_strings(self, use_unicode)
    512         vals = self.values
    513
--> 514         is_float = lib.map_infer(vals, com.is_float) & notnull(vals)
    515         leading_space = is_float.any()
    516

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/common.pyc in notnull(obj)
    100     boolean ndarray or boolean
    101     '''
--> 102     res = isnull(obj)
    103     if np.isscalar(res):
    104         return not res

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/common.pyc in isnull(obj)
     58     from pandas.core.generic import PandasObject
     59     if isinstance(obj, np.ndarray):
---> 60         return _isnull_ndarraylike(obj)
     61     elif isinstance(obj, PandasObject):
     62         # TODO: optimize for DataFrame, etc.

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/common.pyc in _isnull_ndarraylike(obj)
     75         shape = values.shape
     76         result = np.empty(shape, dtype=bool)
---> 77         vec = lib.isnullobj(values.ravel())
     78         result[:] = vec.reshape(shape)
     79

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.isnullobj (pandas/src/tseries.c:5269)()

ValueError: Does not understand character buffer dtype format string ('s')

如果我将字符串 'L' 和 'S' 替换为 -1 和 1 之类的整数，它可以正常工作，所以这是一种解决方法。但奇怪的问题是，如果我将的输出np.where与行数较少的 DataFrame 混合会发生什么。

In [22]: dfrm1 = dfrm.ix[0:7]

In [23]: dfrm1
Out[23]:
          A         B         C     D
0  1.202034 -0.285256  0.392160     0
1  1.799628 -0.169389 -0.305984     3
2  1.262144 -1.165034 -1.780316     6
3 -0.355975  1.610605  1.298506  None
4 -0.139220  0.024292  0.132928    12
5  0.921821 -0.109189 -0.539100    15
6  0.987901 -1.253987 -1.139684    18
7  2.170929  0.520814 -0.139740   NaN

In [24]: dfrm
Out[24]:
          A         B         C     D
0  1.202034 -0.285256  0.392160     0
1  1.799628 -0.169389 -0.305984     3
2  1.262144 -1.165034 -1.780316     6
3 -0.355975  1.610605  1.298506  None
4 -0.139220  0.024292  0.132928    12
5  0.921821 -0.109189 -0.539100    15
6  0.987901 -1.253987 -1.139684    18
7  2.170929  0.520814 -0.139740   NaN
8 -2.329704 -0.475419  1.473144    24
9  1.161275  0.918900 -1.077892    27

** 为什么以下行可以正常工作？**

In [25]: dfrm1['E'] = np.where(dfrm['D'] > 12, -1, 1)

In [26]: dfrm1
Out[26]:
          A         B         C     D  E
0  1.202034 -0.285256  0.392160     0  1
1  1.799628 -0.169389 -0.305984     3  1
2  1.262144 -1.165034 -1.780316     6  1
3 -0.355975  1.610605  1.298506  None  1
4 -0.139220  0.024292  0.132928    12  1
5  0.921821 -0.109189 -0.539100    15 -1
6  0.987901 -1.253987 -1.139684    18 -1
7  2.170929  0.520814 -0.139740   NaN  1

即使我首先保存输出（对于较小的 DataFramenp.where将没有dfrm1正确的行数），使用保存的对象也可以。

In [28]: tmp = np.where(dfrm['D'] > 12, -1, 1)

In [29]: tmp
Out[29]:
0    1
1    1
2    1
3    1
4    1
5   -1
6   -1
7    1
8   -1
9   -1
Name: D

In [30]: dfrm1['F'] = tmp

In [31]: dfrm1
Out[31]:
          A         B         C     D  E  F
0  1.202034 -0.285256  0.392160     0  1  1
1  1.799628 -0.169389 -0.305984     3  1  1
2  1.262144 -1.165034 -1.780316     6  1  1
3 -0.355975  1.610605  1.298506  None  1  1
4 -0.139220  0.024292  0.132928    12  1  1
5  0.921821 -0.109189 -0.539100    15 -1 -1
6  0.987901 -1.253987 -1.139684    18 -1 -1
7  2.170929  0.520814 -0.139740   NaN  1  1

我认为这可能是由于 Pandas 以某种方式共享有关索引对象的元数据，并且如果数据来自具有相同索引的对象，则可能在插入数据时被截断。

In [33]: tmp1 = tmp.reset_index(drop=True)

In [34]: dfrm1['G'] = tmp1

In [35]: dfrm1
Out[35]:
          A         B         C     D  E  F  G
0  1.202034 -0.285256  0.392160     0  1  1  1
1  1.799628 -0.169389 -0.305984     3  1  1  1
2  1.262144 -1.165034 -1.780316     6  1  1  1
3 -0.355975  1.610605  1.298506  None  1  1  1
4 -0.139220  0.024292  0.132928    12  1  1  1
5  0.921821 -0.109189 -0.539100    15 -1 -1 -1
6  0.987901 -1.253987 -1.139684    18 -1 -1 -1
7  2.170929  0.520814 -0.139740   NaN  1  1  1

但即使在探索了 Index 对象的特定对象 ID 之后，也没有明确的模式。

In [36]: id(tmp.index)
Out[36]: 96118016

In [37]: id(tmp1.index)
Out[37]: 104735160

In [38]: id(dfrm.index)
Out[38]: 96118016

In [39]: id(dfrm1.index)
Out[39]: 104322304

请注意，如果我只是尝试分配一系列尺寸不正确的数据，则会失败：

In [40]: dfrm1['H'] = np.arange(10)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-40-987f4eb97131> in <module>()
----> 1 dfrm1['H'] = np.arange(10)

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)
   1710         else:
   1711             # set column
-> 1712             self._set_item(key, value)
   1713
   1714     def _boolean_set(self, key, value):

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in _set_item(self, key, value)
   1749         ensure homogeneity.
   1750         """
-> 1751         value = self._sanitize_column(key, value)
   1752         NDFrame._set_item(self, key, value)
   1753

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value)
   1778                     value = value.reindex(self.index).values
   1779             else:
-> 1780                 assert(len(value) == len(self.index))
   1781
   1782                 if not isinstance(value, np.ndarray):

AssertionError:

In [41]: dfrm1['H'] = np.arange(8)

In [42]: dfrm1
Out[42]:
          A         B         C     D  E  F  G  H
0  1.202034 -0.285256  0.392160     0  1  1  1  0
1  1.799628 -0.169389 -0.305984     3  1  1  1  1
2  1.262144 -1.165034 -1.780316     6  1  1  1  2
3 -0.355975  1.610605  1.298506  None  1  1  1  3
4 -0.139220  0.024292  0.132928    12  1  1  1  4
5  0.921821 -0.109189 -0.539100    15 -1 -1 -1  5
6  0.987901 -1.253987 -1.139684    18 -1 -1 -1  6
7  2.170929  0.520814 -0.139740   NaN  1  1  1  7

为什么会以np.where不同方式处理的输出？

score 4 · Accepted Answer

这是预期的；您正在为 DataFrame 列分配一个系列。它是对齐的，因此更短（或更长）是无关紧要的。索引匹配并采用这些值。直接 numpy 数组仅在长度相同时才有效的原因很简单；对齐是不可能的，所以它必须是相同的长度。

如果索引不相关，则为 id；仅当它比较相等时，例如 i1.equals(i2)

使用不是数字或偏移的标签尝试整个练习（而不是从 0 开始，您将看到对齐是否起作用）

score 0 · Accepted Answer

ValueError发生因为将where构造一个ndarray字符串，但isnullobj需要一个数组object dtype。
在 0.8.1 版本中，where赋值有效，因为表达式的右侧（where计算）返回 aSeries可以重新索引以匹配DataFrame左侧较小的索引。

python - 行数不匹配的 Python Pandas 和 NumPy.where 行为

2 回答 2

Related

Reference