2

graphlab,我有以下SFrame电话train

import graphlab
train = graphlab.read_csv('clean_train.csv')
train.head()

[出去]:

+-------+------------+---------+-----------+
| Store |    Date    |  Sales  | Customers |
+-------+------------+---------+-----------+
|   1   | 2015-07-31 |  5263.0 |   555.0   |
|   2   | 2015-07-31 |  6064.0 |   625.0   |
|   3   | 2015-07-31 |  8314.0 |   821.0   |
|   4   | 2015-07-31 | 13995.0 |   1498.0  |
|   3   | 2015-07-20 |  4822.0 |   559.0   |
|   2   | 2015-07-10 |  5651.0 |   589.0   |
|   4   | 2015-07-11 | 15344.0 |   1414.0  |
|   5   | 2015-07-23 |  8492.0 |   833.0   |
|   2   | 2015-07-19 |  8565.0 |   687.0   |
|   10  | 2015-07-09 |  7185.0 |   681.0   |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]

要获得每家商店的销售额中位数,我可以执行以下操作以使用以下方法为每家商店的销售额中位数附加一个新列graphlab

mediansales_perstore = train.groupby('Store', operations={'mediansales': agg.QUANTILE('Sales', 0.5)})
train_stores = train_stores.join(mediansales_perstore, on='Store')
test_stores['mediansales'] = [i[0] for i in test_stores['mediansales']]

该代码的工作原理graphlab是添加了一个新行mediansales。但是当我尝试使用pandas DataFrame代码时:

mediansales_perstore = train.groupby(['Store'])['Sales'].median()

这会根据 graphlab 代码提取每家商店的销售额中值,但是当我尝试将其合并回训练矩阵时:

train.join(pd.DataFrame(train.groupby(['Store'])['Sales'].median()), on='Store')

它失败并抛出错误:

ValueError                                Traceback (most recent call last)
<ipython-input-15-7b64cb46e386> in <module>()
----> 1 train.join(pd.DataFrame(train.groupby(['Store'])['Sales'].median()), on='Store')

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in join(self, other, on, how, lsuffix, rsuffix, sort)
   4017         # For SparseDataFrame's benefit
   4018         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 4019                                  rsuffix=rsuffix, sort=sort)
   4020 
   4021     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   4031             return merge(self, other, left_on=on, how=how,
   4032                          left_index=on is None, right_index=True,
-> 4033                          suffixes=(lsuffix, rsuffix), sort=sort)
   4034         else:
   4035             if on is not None:

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     36                          right_index=right_index, sort=sort, suffixes=suffixes,
     37                          copy=copy)
---> 38     return op.get_result()
     39 if __debug__:
     40     merge.__doc__ = _merge_doc % '\nleft : DataFrame'

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in get_result(self)
    190 
    191         llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf,
--> 192                                                      rdata.items, rsuf)
    193 
    194         lindexers = {1: left_indexer} if left_indexer is not None else {}

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in items_overlap_with_suffix(left, lsuffix, right, rsuffix)
   3969         if not lsuffix and not rsuffix:
   3970             raise ValueError('columns overlap but no suffix specified: %s' %
-> 3971                              to_rename)
   3972 
   3973         def lrenamer(x):

ValueError: columns overlap but no suffix specified: Index([u'Sales'], dtype='object')

如何使用“Store”作为键合并“Sales”列的中位数pandasgraphlab代码虽然有效。

4

1 回答 1

1

您可以使用以下方法在一个阶段执行此操作transform

>>> train['Median-Sales'] = train.groupby('Store')['Sales'].transform('median')
>>> train
   Store        Date  Sales  Customers  Median-Sales
0      1  2015-07-31   5263        555        5263.0
1      2  2015-07-31   6064        625        6064.0
2      3  2015-07-31   8314        821        6568.0
3      4  2015-07-31  13995       1498       14669.5
4      3  2015-07-20   4822        559        6568.0
5      2  2015-07-10   5651        589        6064.0
6      4  2015-07-11  15344       1414       14669.5
7      5  2015-07-23   8492        833        8492.0
8      2  2015-07-19   8565        687        6064.0
9     10  2015-07-09   7185        681        7185.0

合并错误只是说您在左右框架中有重复的列名,因此您需要提供后缀来区分列或重命名列:

>>> right = train.groupby('Store')['Sales'].median()
>>> right.name = 'Median-Sales'
>>> train.join(right, on='Store')
   Store        Date  Sales  Customers  Median-Sales
0      1  2015-07-31   5263        555        5263.0
1      2  2015-07-31   6064        625        6064.0
2      3  2015-07-31   8314        821        6568.0
3      4  2015-07-31  13995       1498       14669.5
4      3  2015-07-20   4822        559        6568.0
5      2  2015-07-10   5651        589        6064.0
6      4  2015-07-11  15344       1414       14669.5
7      5  2015-07-23   8492        833        8492.0
8      2  2015-07-19   8565        687        6064.0
9     10  2015-07-09   7185        681        7185.0
于 2015-11-22T21:43:24.097 回答