我在下面的所有示例中都使用了 Pandas 0.8.1,但我可以确认,当我使用 Pandas 0.11 时,相同的示例对我的工作方式相同。
依赖于将 Pandas 版本更改为较新版本的解决方案不适用于我当前的问题(尽管请随时添加评论(而不是答案)关于这是否在较新的 Pandas 版本中得到修复)。
我有一个示例 Pandas DataFrame 对象
In [20]: dfrm
Out[20]:
A B C D
0 1.202034 -0.285256 0.392160 0
1 1.799628 -0.169389 -0.305984 3
2 1.262144 -1.165034 -1.780316 6
3 -0.355975 1.610605 1.298506 None
4 -0.139220 0.024292 0.132928 12
5 0.921821 -0.109189 -0.539100 15
6 0.987901 -1.253987 -1.139684 18
7 2.170929 0.520814 -0.139740 NaN
8 -2.329704 -0.475419 1.473144 24
9 1.161275 0.918900 -1.077892 27
首先,我对我看到的类型错误有点困惑。如果我尝试使用numpy.where
创建特定列的不同子集的一些字符串标签,看起来标签的字符串性质会产生错误。
In [21]: np.where(dfrm['D'] > 12, 'L', 'S')
Out[21]: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-21-a40c5cd8713c> in <module>()
----> 1 np.where(dfrm['D'] > 12, 'L', 'S')
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/displayhook.pyc in __call__(self, result)
236 self.start_displayhook()
237 self.write_output_prompt()
--> 238 format_dict = self.compute_format_data(result)
239 self.write_format_data(format_dict)
240 self.update_user_ns(result)
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/displayhook.pyc in compute_format_data(self, result)
148 MIME type representation of the object.
149 """
--> 150 return self.shell.display_formatter.format(result)
151
152 def write_format_data(self, format_dict):
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/formatters.pyc in format(self, obj, include, exclude)
124 continue
125 try:
--> 126 data = formatter(obj)
127 except:
128 # FIXME: log the exception
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/formatters.pyc in __call__(self, obj)
445 type_pprinters=self.type_printers,
446 deferred_pprinters=self.deferred_printers)
--> 447 printer.pretty(obj)
448 printer.flush()
449 return stream.getvalue()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/lib/pretty.pyc in pretty(self, obj)
358 if callable(meth):
359 return meth(obj, self, cycle)
--> 360 return _default_pprint(obj, self, cycle)
361 finally:
362 self.end_group()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/lib/pretty.pyc in _default_pprint(obj, p, cycle)
478 if getattr(klass, '__repr__', None) not in _baseclass_reprs:
479 # A user-provided repr.
--> 480 p.text(repr(obj))
481 return
482 p.begin_group(1, '<')
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/series.pyc in __repr__(self)
772 result = self._get_repr(print_header=True,
773 length=len(self) > 50,
--> 774 name=True)
775 else:
776 result = '%s' % ndarray.__repr__(self)
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/series.pyc in _get_repr(self, name, print_header, length, na_rep, float_format)
833 length=length, na_rep=na_rep,
834 float_format=float_format)
--> 835 return formatter.to_string()
836
837 def __str__(self):
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in to_string(self)
109
110 fmt_index, have_header = self._get_formatted_index()
--> 111 fmt_values = self._get_formatted_values()
112
113 maxlen = max(len(x) for x in fmt_index)
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in _get_formatted_values(self)
100 return format_array(self.series.values, None,
101 float_format=self.float_format,
--> 102 na_rep=self.na_rep)
103
104 def to_string(self):
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in format_array(values, formatter, float_format, na_rep, digits, space, justify)
460 justify=justify)
461
--> 462 return fmt_obj.get_result()
463
464
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in get_result(self)
479 fmt_values = self._format_strings(use_unicode=True)
480 else:
--> 481 fmt_values = self._format_strings(use_unicode=False)
482
483 return _make_fixed_width(fmt_values, self.justify)
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in _format_strings(self, use_unicode)
512 vals = self.values
513
--> 514 is_float = lib.map_infer(vals, com.is_float) & notnull(vals)
515 leading_space = is_float.any()
516
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/common.pyc in notnull(obj)
100 boolean ndarray or boolean
101 '''
--> 102 res = isnull(obj)
103 if np.isscalar(res):
104 return not res
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/common.pyc in isnull(obj)
58 from pandas.core.generic import PandasObject
59 if isinstance(obj, np.ndarray):
---> 60 return _isnull_ndarraylike(obj)
61 elif isinstance(obj, PandasObject):
62 # TODO: optimize for DataFrame, etc.
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/common.pyc in _isnull_ndarraylike(obj)
75 shape = values.shape
76 result = np.empty(shape, dtype=bool)
---> 77 vec = lib.isnullobj(values.ravel())
78 result[:] = vec.reshape(shape)
79
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.isnullobj (pandas/src/tseries.c:5269)()
ValueError: Does not understand character buffer dtype format string ('s')
如果我将字符串 'L' 和 'S' 替换为 -1 和 1 之类的整数,它可以正常工作,所以这是一种解决方法。但奇怪的问题是,如果我将 的输出np.where
与行数较少的 DataFrame 混合会发生什么。
In [22]: dfrm1 = dfrm.ix[0:7]
In [23]: dfrm1
Out[23]:
A B C D
0 1.202034 -0.285256 0.392160 0
1 1.799628 -0.169389 -0.305984 3
2 1.262144 -1.165034 -1.780316 6
3 -0.355975 1.610605 1.298506 None
4 -0.139220 0.024292 0.132928 12
5 0.921821 -0.109189 -0.539100 15
6 0.987901 -1.253987 -1.139684 18
7 2.170929 0.520814 -0.139740 NaN
In [24]: dfrm
Out[24]:
A B C D
0 1.202034 -0.285256 0.392160 0
1 1.799628 -0.169389 -0.305984 3
2 1.262144 -1.165034 -1.780316 6
3 -0.355975 1.610605 1.298506 None
4 -0.139220 0.024292 0.132928 12
5 0.921821 -0.109189 -0.539100 15
6 0.987901 -1.253987 -1.139684 18
7 2.170929 0.520814 -0.139740 NaN
8 -2.329704 -0.475419 1.473144 24
9 1.161275 0.918900 -1.077892 27
** 为什么以下行可以正常工作?**
In [25]: dfrm1['E'] = np.where(dfrm['D'] > 12, -1, 1)
In [26]: dfrm1
Out[26]:
A B C D E
0 1.202034 -0.285256 0.392160 0 1
1 1.799628 -0.169389 -0.305984 3 1
2 1.262144 -1.165034 -1.780316 6 1
3 -0.355975 1.610605 1.298506 None 1
4 -0.139220 0.024292 0.132928 12 1
5 0.921821 -0.109189 -0.539100 15 -1
6 0.987901 -1.253987 -1.139684 18 -1
7 2.170929 0.520814 -0.139740 NaN 1
即使我首先保存输出(对于较小的 DataFramenp.where
将没有dfrm1
正确的行数),使用保存的对象也可以。
In [28]: tmp = np.where(dfrm['D'] > 12, -1, 1)
In [29]: tmp
Out[29]:
0 1
1 1
2 1
3 1
4 1
5 -1
6 -1
7 1
8 -1
9 -1
Name: D
In [30]: dfrm1['F'] = tmp
In [31]: dfrm1
Out[31]:
A B C D E F
0 1.202034 -0.285256 0.392160 0 1 1
1 1.799628 -0.169389 -0.305984 3 1 1
2 1.262144 -1.165034 -1.780316 6 1 1
3 -0.355975 1.610605 1.298506 None 1 1
4 -0.139220 0.024292 0.132928 12 1 1
5 0.921821 -0.109189 -0.539100 15 -1 -1
6 0.987901 -1.253987 -1.139684 18 -1 -1
7 2.170929 0.520814 -0.139740 NaN 1 1
我认为这可能是由于 Pandas 以某种方式共享有关索引对象的元数据,并且如果数据来自具有相同索引的对象,则可能在插入数据时被截断。
In [33]: tmp1 = tmp.reset_index(drop=True)
In [34]: dfrm1['G'] = tmp1
In [35]: dfrm1
Out[35]:
A B C D E F G
0 1.202034 -0.285256 0.392160 0 1 1 1
1 1.799628 -0.169389 -0.305984 3 1 1 1
2 1.262144 -1.165034 -1.780316 6 1 1 1
3 -0.355975 1.610605 1.298506 None 1 1 1
4 -0.139220 0.024292 0.132928 12 1 1 1
5 0.921821 -0.109189 -0.539100 15 -1 -1 -1
6 0.987901 -1.253987 -1.139684 18 -1 -1 -1
7 2.170929 0.520814 -0.139740 NaN 1 1 1
但即使在探索了 Index 对象的特定对象 ID 之后,也没有明确的模式。
In [36]: id(tmp.index)
Out[36]: 96118016
In [37]: id(tmp1.index)
Out[37]: 104735160
In [38]: id(dfrm.index)
Out[38]: 96118016
In [39]: id(dfrm1.index)
Out[39]: 104322304
请注意,如果我只是尝试分配一系列尺寸不正确的数据,则会失败:
In [40]: dfrm1['H'] = np.arange(10)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-40-987f4eb97131> in <module>()
----> 1 dfrm1['H'] = np.arange(10)
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)
1710 else:
1711 # set column
-> 1712 self._set_item(key, value)
1713
1714 def _boolean_set(self, key, value):
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in _set_item(self, key, value)
1749 ensure homogeneity.
1750 """
-> 1751 value = self._sanitize_column(key, value)
1752 NDFrame._set_item(self, key, value)
1753
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value)
1778 value = value.reindex(self.index).values
1779 else:
-> 1780 assert(len(value) == len(self.index))
1781
1782 if not isinstance(value, np.ndarray):
AssertionError:
In [41]: dfrm1['H'] = np.arange(8)
In [42]: dfrm1
Out[42]:
A B C D E F G H
0 1.202034 -0.285256 0.392160 0 1 1 1 0
1 1.799628 -0.169389 -0.305984 3 1 1 1 1
2 1.262144 -1.165034 -1.780316 6 1 1 1 2
3 -0.355975 1.610605 1.298506 None 1 1 1 3
4 -0.139220 0.024292 0.132928 12 1 1 1 4
5 0.921821 -0.109189 -0.539100 15 -1 -1 -1 5
6 0.987901 -1.253987 -1.139684 18 -1 -1 -1 6
7 2.170929 0.520814 -0.139740 NaN 1 1 1 7
为什么会以np.where
不同方式处理的输出?