7

在 numpy 中,我有一个这样的数据集。前两列是索引。我可以通过索引将我的数据集划分为块,即第一个块是 0 0 第二个块是 0 1 第三个块 0 2 然后是 1 0、1 1、1 2 等等。每个块至少有两个元素。索引列中的数字可能会有所不同

我需要将数据集沿这些块随机拆分 80%-20%,以便在拆分后两个数据集中的每个块至少有 1 个元素。我怎么能那样做?

indices | real data
        |
0   0   | 43.25 665.32 ...  } 1st block
0   0   | 11.234            }
0   1     ...               } 2nd block
0   1                       } 
0   2                       } 3rd block
0   2                       }
1   0                       } 4th block
1   0                       }
1   0                       }
1   1                       ...
1   1                       
1   2
1   2
2   0
2   0 
2   1
2   1
2   1
...
4

4 回答 4

5

看看你喜欢这个。为了引入随机性,我对整个数据集进行了洗牌。这是我想出如何进行矢量化分割的唯一方法。也许你可以简单地打乱一个索引数组,但对于我今天的大脑来说,这是一种太多的间接方式。我还使用了结构化数组,以便于提取块。首先,让我们创建一个示例数据集:

from __future__ import division
import numpy as np

# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)

items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)

dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
                             ('data', np.float)])
dataset['idx1'][:2*c1*c2] =  np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3

现在是分层抽样:

# For randomness, shuffle the entire array
np.random.shuffle(dataset)

blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))

# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)

x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B

a_idx = threshold > np.random.rand(len(dataset))

A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]

运行后,分割大约为 80/20,所有块都表示在两个数组中:

>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True
于 2013-04-05T18:35:45.450 回答
0

这是一个替代解决方案。如果有可能以更 numpy 的方式实现这一点(没有 for 循环),我愿意进行代码审查。@Jamie 的答案非常好,只是有时它会在数据块内产生倾斜的比率。

    ratio = 0.8
    IDX1 = 0
    IDX2 = 1
    idx1s = np.arange(len(np.unique(self.data[:,IDX1])))
    idx2s = np.arange(len(np.unique(self.data[:,IDX2])))
    valid = None
    train = None
    for i1 in idx1s:
        for i2 in idx2:
            mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
            curr_data = data[mask,:]
            np.random.shuffle(curr_data)
            start = np.min(mask)
            end = np.max(mask)
            thres = start + np.around((end - start) * ratio).astype(np.int)

            selected = mask < thres
            train_idx = mask[0][selected[0]]
            valid_idx = mask[0][~selected[0]]
            if train != None:
                train = np.vstack((train,data[train_idx]))
                valid = np.vstack((valid,data[valid_idx]))
            else:
                train = data[train_idx]
                valid = data[valid_idx]
于 2013-04-08T14:27:33.967 回答
0

我假设每个块至少有两个条目,并且如果它有两个以上,您希望它们尽可能接近 80/20。最简单的方法似乎是为所有行分配一个随机数,然后根据每个分层样本中的百分位数进行选择。假设这是文件 strat_sample.csv 中的数据:

Index_1,Index_2,Data_1,Data_2
0,0,0.614583182,0.677644482
0,0,0.321384981,0.598450854
0,0,0.303029607,0.300593782
0,0,0.646010758,0.612006715
0,0,0.484572883,0.30052535
0,1,0.010625416,0.118671475
0,1,0.428967984,0.23795173
0,1,0.523440618,0.457275922
0,1,0.379612652,0.337640868
0,1,0.338180659,0.206399031
1,0,0.079386,0.890939911
1,0,0.572864624,0.725615079
1,0,0.045891404,0.300128917
1,0,0.578792198,0.100698871
1,0,0.776485138,0.475135948
1,0,0.401850419,0.784835723
1,1,0.087660923,0.497299605
1,1,0.8460978,0.825774802
1,1,0.526015021,0.581905971
1,1,0.23324672,0.299475291

然后此代码(使用 Pandas 数据结构)按需要工作

import numpy as np
import random as rnd
import pandas as pd
#sample data strat_sample.csv, contents to follow

def TreatmentOneCount(n , *args):
    #assign a minimum one to each group but as close as possible to fraction OptimalRatio in group 1. 
    OptimalRatio = args[0]
    if n < 2:
        print("N too small, assignment not defined.")
        a = NaN
    elif n == 2:
        a = 1
    else:
        """
        There are one of two numbers that are close to the target ratio, one above, the other below
        If the number above is N and it is closest to optimal, then you need to set things to N-1 to ensure both groups have at least one member (recall n>2)
        If the number below is 0 and it is closest to optimal, then you need to set things to 1 to ensure both groups have at least one member (recall n>2)
        """
        targetassigment = OptimalRatio * n
        if  targetassigment - floor(targetassigment) > 0.5:
            a = min(ceil(targetassigment),n-1)
        else:
            a = max(floor(targetassigment),1)
    return a


df = pd.read_csv('strat_sample.csv', sep=','  , header=0)

#assign a random number to each entry
df['RandScore'] =  np.random.uniform(0,1,df.shape[0])
df.sort(columns= ['Index_1' ,'Index_2','RandScore'], inplace = True)

#Within each block assign a rank based on random number. 
df['RandRank'] = df.groupby(['Index_1','Index_2'])['RandScore'].rank()

#make a group index
df['MasterIdx'] = df['Index_1'].apply(str) + df['Index_2'].apply(str)

#Store the counts for members of each block
seriestest = df.groupby('MasterIdx')['RandRank'].count()
seriestest.name = "Counts"
dftest = pd.DataFrame(seriestest)

#Add the block counts to the data
df = df.merge(dftest, how='left',  left_on = 'MasterIdx', right_index= True)

#Make the actual assignments to the two groups
df['Assignment'] = (df['RandRank'] <=  df['Counts'].apply(TreatmentOneCount, args = (0.8,))) * -1 + 2
df.drop(['MasterIdx', 'Counts', 'RandRank', 'RandScore'], axis=1)
于 2013-04-08T14:53:54.797 回答
-3
from sklearn import cross_validation

X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)
于 2015-01-02T11:06:41.170 回答