1

我在 R 中使用以下函数将主题/样本拆分为训练和测试集,它工作得非常好。但是,在我的数据集中,受试者分为两组(患者和对照受试者),因此,我希望拆分数据,同时保持每个训练和测试集中患者和对照受试者的比例与完整的比例相同数据集。我怎么能在 R 中做到这一点?如何修改以下函数,以便在将数据拆分为训练集和测试集时考虑组关联?

# splitdf function will return a list of training and testing sets#
splitdf <- function(dataframe, seed=NULL) { 
  if (!is.null(seed)) 
     set.seed(seed)

  index <- 1:nrow(dataframe)
  trainindex <- sample(index, trunc(length(index)/2))
  trainset <- dataframe[trainindex, ] 
  testset <- dataframe[-trainindex, ] 
  list(trainset=trainset,testset=testset) 
}

# apply the function
splits <- splitdf(Data, seed=808)

# it returns a list - two data frames called trainset and testset
str(splits)    

# there are "n" observations in each data frame
lapply(splits,nrow)   

# view the first few columns in each data frame
lapply(splits,head)   

# save the training and testing sets as data frames
training <- splits$trainset
testing <- splits$testset` 
#

示例:使用内置的 iris 数据并将数据集拆分为训练集和测试集。该数据集有 150 个样本,并有一个称为 Species 的因子,由 3 个级别(setosa、versicolor 和 virginica)组成

加载虹膜数据

data(iris)

将数据集拆分为训练集和测试集:

splits <- splitdf(iris, seed=808)

str(splits)
lapply(splits,nrow)
lapply(splits,head)
training <- splits$trainset
testing <- splits$testset

正如您在此处看到的,函数“splitdf”在将数据拆分为训练集和测试集时不考虑组隶属关系“物种”,因此在训练和测试集与主数据集不成比例。那么,我该如何修改函数,以便在将数据拆分为训练集和测试集时考虑到组隶属关系?

4

1 回答 1

0

这是使用plyr模拟数据集的解决方案。

library(plyr)
set.seed(1001)
dat = data.frame(matrix(rnorm(1000), ncol = 10), treatment = sample(c("control", "control", "treatment"), 100, replace = T) )

# divide data set into training and test sets
tr_prop = 0.5    # proportion of full dataset to use for training
training_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)

# check that proportions are equal across datasets
ddply(dat, .(treatment), function(.) nrow(.)/nrow(dat) )
ddply(training_set, .(treatment), function(.) nrow(.)/nrow(training_set) )
ddply(test_set, .(treatment), function(.) nrow(.)/nrow(test_set) )
c(nrow(training_set), nrow(test_set), nrow(dat)) # lengths of sets

在这里,我set.seed()用来确保sample()在使用ddply. 这让我觉得有点骇人听闻。也许还有另一种方法可以使用一次调用**ply(但返回两个数据帧)来实现相同的结果。另一种选择(不过度使用set.seed)是使用dlply然后将结果列表的元素拼凑成训练/测试集:

set.seed(101) # for consistancy with 'ddply' above
split_set = dlply(dat, .(treatment), function(.) { s = sample(1:nrow(.), trunc(nrow(.) * tr_prop)); list(.[s, ], .[-s,]) } )
# join together with ldply()
training_set = ldply(split_set, function(.) .[[1]])
test_set = ldply(split_set, function(.) .[[2]])
于 2013-09-22T18:42:55.057 回答