python - 如何将数据拆分为训练和测试，记住熊猫中的 groupby 列？

Question

我想以 20:80 的比例将数据集拆分为测试和训练数据集。但是，在拆分时，我不想以 1 S_Id 值在训练中的数据点很少而在测试中的其他数据点的方式进行拆分。

我有一个数据集：

S_Id      Datetime               Item      
1         29-06-2018 03:23:00    654
1         29-06-2018 04:01:00    452
1         29-06-2018 04:25:00    101
2         30-06-2018 05:17:00    088
2         30-06-2018 05:43:00    131
3         30-06-2018 10:36:00    013
3         30-06-2018 11:19:00    092

我想整齐地拆分为：火车：

S_Id      Datetime               Item      
1         29-06-2018 03:23:00    654
1         29-06-2018 04:01:00    452
1         29-06-2018 04:25:00    101
2         30-06-2018 05:17:00    088
2         30-06-2018 05:43:00    131

测试：

S_Id      Datetime               Item 
3         30-06-2018 10:36:00    013
3         30-06-2018 11:19:00    092

所有相同的 S_Id 必须在一组中。可以通过简单的'groupby'来完成吗？

谢谢您的帮助！

score 1 · Accepted Answer

我不相信有这样做的直接功能，所以你可以写一个定制的：

def sample_(we_array, train_size):
    """
     we_array : used as the weight of each unique element on your S_id column, 
     it's normalized to represent a probability

    """
    idx = np.arange(we_array.size) #get the index for each element on the array
    np.random.shuffle(idx) #shuffle it 
    cum = we_array[idx].cumsum()  
    train_idx = idx[cum<train_size]# we take the first elements until we have  
                                   # our desired size
    test_idx = idx[cum>=train_size]
    return train_idx, test_idx

idx = df.S_Id.values
unique, counts = np.unique(idx, return_counts = True) # we access the unique 
                                                      # elements and their cout
probability = counts/counts.sum()
train_idx, test_idx = sample_(probability, 0.8)
train = df[df.S_Id.isin(unique[train_idx])] 
test = df[df.S_Id.isin(unique[test_idx])]

score 0 · Accepted Answer

如果 S_Id 是数据帧的索引，您可以简单地使用：

df.loc[3]

如果不是这样，你可以将其设置为 index 然后 loc，如：

df.set_index(S_Id).loc[3]

这应该返回包含 S_Id 等于 3 的所有行的数据帧。

python - 如何将数据拆分为训练和测试，记住熊猫中的 groupby 列？

2 回答 2

Related

Reference