最小示例:
考虑这个数据框temp
:
temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> temp
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
现在,尝试在 for 循环中一次洗牌每一列。
>>> for i in temp.columns:
... np.random.shuffle(temp.loc[:,i])
... print(temp)
...
A B C
0 8 2 3
1 3 3 4
2 9 4 5
3 6 5 6
4 4 6 7
5 10 7 8
6 7 8 9
7 1 9 10
8 2 10 11
9 5 11 12
A B C
0 8 7 3
1 3 9 4
2 9 8 5
3 6 10 6
4 4 4 7
5 10 11 8
6 7 5 9
7 1 3 10
8 2 2 11
9 5 6 12
A B C
0 8 7 6
1 3 9 8
2 9 8 4
3 6 10 10
4 4 4 7
5 10 11 11
6 7 5 5
7 1 3 3
8 2 2 12
9 5 6 9
这完美地工作。
具体例子:
现在,如果我想获得这个数据框的一部分,用于训练和测试目的,那么我将train_test_split
使用sklearn.model_selection
.
>>> from sklearn.model_selection import train_test_split
>>> temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> y = [i for i in range(16,26)]
>>> len(y)
10
>>> X_train,X_test,y_train,y_test = train_test_split(temp,y,test_size=0.2)
>>> X_train
A B C
2 3 4 5
6 7 8 9
8 9 10 11
0 1 2 3
7 8 9 10
3 4 5 6
1 2 3 4
9 10 11 12
现在,我们已经获得了我们的X_train
数据框。为了洗牌它的每一列:
>>> for i in X_train.columns:
... np.random.shuffle(X_train.loc[:,i])
... print(X_train)
...
不幸的是,这会导致错误。
错误:
sys:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "mtrand.pyx", line 4852, in mtrand.RandomState.shuffle
File "mtrand.pyx", line 4855, in mtrand.RandomState.shuffle
File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\series.py", line 623, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2560, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\_libs\index.pyx", line 83, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 91, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 811, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 817, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 4
跟踪问题及其解决方案:
在 下SettingWithCopyWarning
,我发现了这个问题,它的第一个答案下面有这一行:
但是,它可以创建一个副本来更新
data['amount']
您看不到的副本。然后你会想知道为什么它不更新。
但是,如果是这种情况,那么为什么代码对第一种情况有效?
答案中还给出了:
Pandas 几乎在所有方法调用中都会返回一个对象的副本。就地操作是一种可行的操作,但通常不清楚数据是否正在被修改并且可能在副本上工作。
np.random.shuffle
因此,我们可以使用来代替使用np.random.permutation
,如本答案所示。所以:
>>> for i in X_train.columns:
... X_train.loc[:,i] = np.random.permutation(X_train.loc[:,i])
... print(X_train)
...
但是,我SettingWithCopyWarning
又得到了答案,也得到了答案。
C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexing.py:621: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item_labels[indexer[info_axis]]] = value
A B C
2 10 4 5
6 9 8 9
8 2 10 11
0 8 2 3
7 1 9 10
3 3 5 6
1 4 3 4
9 7 11 12
A B C
2 10 5 5
6 9 11 9
8 2 4 11
0 8 9 3
7 1 3 10
3 3 8 6
1 4 10 4
9 7 2 12
A B C
2 10 5 10
6 9 11 5
8 2 4 11
0 8 9 3
7 1 3 4
3 3 8 6
1 4 10 12
9 7 2 9
这可能是一种解决方法。
问题:
- 当我使用时,为什么代码适用于第一种情况,而不适用于第二种情况
train_test_split
? SettingWithCopyWarning
当我不使用就地洗牌器时,为什么我仍然会得到np.random.shuffle
?
征求意见:
- 是否有更好(易于使用/无错误/更快)的方法来进行列改组?