r - 如何使用样本函数将数据拆分为训练/测试集

Question

我刚刚开始使用 R，但我不确定如何将我的数据集与以下示例代码合并：

sample(x, size, replace = FALSE, prob = NULL)

我有一个数据集，我需要将其放入训练 (75%) 和测试 (25%) 集中。我不确定我应该在 x 和 size 中输入什么信息？x 是数据集文件，我有多少样本？

score 297 · Accepted Answer

有许多方法可以实现数据分区。如需更完整的方法，请查看包中的createDataPartition函数caret。

这是一个简单的例子：

data(mtcars)

## 75% of the sample size
smp_size <- floor(0.75 * nrow(mtcars))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)

train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]

score 113 · Accepted Answer

可以通过以下方式轻松完成：

set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data  
sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
test  <- data[-sample, ]

通过使用caTools包：

require(caTools)
set.seed(101) 
sample = sample.split(data$anycolumn, SplitRatio = .75)
train = subset(data, sample == TRUE)
test  = subset(data, sample == FALSE)

score 39 · Accepted Answer

我会用dplyr这个，让它超级简单。它确实需要数据集中的 id 变量，无论如何这都是一个好主意，不仅可以用于创建数据集，还可以用于项目期间的可追溯性。如果尚未包含，请添加它。

mtcars$id <- 1:nrow(mtcars)
train <- mtcars %>% dplyr::sample_frac(.75)
test  <- dplyr::anti_join(mtcars, train, by = 'id')

score 31 · Accepted Answer

这几乎是相同的代码，但看起来更漂亮

bound <- floor((nrow(df)/4)*3)         #define % of training and test set

df <- df[sample(nrow(df)), ]           #sample rows 
df.train <- df[1:bound, ]              #get training set
df.test <- df[(bound+1):nrow(df), ]    #get test set

score 23 · Accepted Answer

library(caret)
intrain<-createDataPartition(y=sub_train$classe,p=0.7,list=FALSE)
training<-m_train[intrain,]
testing<-m_train[-intrain,]

score 22 · Accepted Answer

我将“a”分成训练（70%）和测试（30%）

    a # original data frame
    library(dplyr)
    train<-sample_frac(a, 0.7)
    sid<-as.numeric(rownames(train)) # because rownames() returns character
    test<-a[-sid,]

完毕

score 17 · Accepted Answer

我的解决方案与 dickoa 的解决方案基本相同，但更容易解释：

data(mtcars)
n = nrow(mtcars)
trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE)
train = mtcars[trainIndex ,]
test = mtcars[-trainIndex ,]

score 12 · Accepted Answer

我可以建议使用 rsample 包：

# choosing 75% of the data to be the training data
data_split <- initial_split(data, prop = .75)
# extracting training data and test data as two seperate dataframes
data_train <- training(data_split)
data_test  <- testing(data_split)

score 7 · Accepted Answer

只是使用很棒的 dplyr库的更简洁的方法：

library(dplyr)
set.seed(275) #to get repeatable data

data.train <- sample_frac(Default, 0.7)

train_index <- as.numeric(rownames(data.train))
data.test <- Default[-train_index, ]

score 7 · Accepted Answer

在查看了此处发布的所有不同方法后，我没有看到任何人利用TRUE/FALSE它来选择和取消选择数据。所以我想我会分享一种利用这种技术的方法。

n = nrow(dataset)
split = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.75, 0.25))

training = dataset[split, ]
testing = dataset[!split, ]

解释

从 R 中选择数据有多种方法，最常见的是人们使用正/负索引来分别选择/取消选择。TRUE/FALSE但是，使用选择/取消选择可以实现相同的功能。

考虑以下示例。

# let's explore ways to select every other element
data = c(1, 2, 3, 4, 5)


# using positive indices to select wanted elements
data[c(1, 3, 5)]
[1] 1 3 5

# using negative indices to remove unwanted elements
data[c(-2, -4)]
[1] 1 3 5

# using booleans to select wanted elements
data[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
[1] 1 3 5

# R recycles the TRUE/FALSE vector if it is not the correct dimension
data[c(TRUE, FALSE)]
[1] 1 3 5

score 6 · Accepted Answer

scorecard包对此有一个有用的功能，您可以在其中指定比率和种子

library(scorecard)

dt_list <- split_df(mtcars, ratio = 0.75, seed = 66)

测试和训练数据存储在一个列表中，可以通过调用dt_list$train和访问dt_list$test

score 5 · Accepted Answer

如果您键入：

?sample

如果会启动一个帮助菜单来解释示例函数的参数是什么意思。

我不是专家，但这里有一些代码：

data <- data.frame(matrix(rnorm(400), nrow=100))
splitdata <- split(data[1:nrow(data),],sample(rep(1:4,as.integer(nrow(data)/4))))
test <- splitdata[[1]]
train <- rbind(splitdata[[1]],splitdata[[2]],splitdata[[3]])

这将为您提供 75% 的训练和 25% 的测试。

score 4 · Accepted Answer

我的解决方案将行打乱，然后将前 75% 的行作为训练，最后 25% 作为测试。超级简单！

row_count <- nrow(orders_pivotted)
shuffled_rows <- sample(row_count)
train <- orders_pivotted[head(shuffled_rows,floor(row_count*0.75)),]
test <- orders_pivotted[tail(shuffled_rows,floor(row_count*0.25)),]

score 2 · Accepted Answer

在创建list相同大小的子样本的函数下方，这不完全是您想要的，但可能对其他人有用。在我的例子中，在较小的样本上创建多个分类树来测试过度拟合：

df_split <- function (df, number){
  sizedf      <- length(df[,1])
  bound       <- sizedf/number
  list        <- list() 
  for (i in 1:number){
    list[i] <- list(df[((i*bound+1)-bound):(i*bound),])
  }
  return(list)
}

例子：

x <- matrix(c(1:10), ncol=1)
x
# [,1]
# [1,]    1
# [2,]    2
# [3,]    3
# [4,]    4
# [5,]    5
# [6,]    6
# [7,]    7
# [8,]    8
# [9,]    9
#[10,]   10

x.split <- df_split(x,5)
x.split
# [[1]]
# [1] 1 2

# [[2]]
# [1] 3 4

# [[3]]
# [1] 5 6

# [[4]]
# [1] 7 8

# [[5]]
# [1] 9 10

score 2 · Accepted Answer

使用基数 R。函数runif生成从 0 到 1 的均匀分布值。通过改变截止值（下例中的 train.size），您将始终在截止值以下具有大致相同百分比的随机记录。

data(mtcars)
set.seed(123)

#desired proportion of records in training set
train.size<-.7
#true/false vector of values above/below the cutoff above
train.ind<-runif(nrow(mtcars))<train.size

#train
train.df<-mtcars[train.ind,]


#test
test.df<-mtcars[!train.ind,]

score 2 · Accepted Answer

假设df是您的数据框，并且您要创建75% 的训练和25% 的测试

all <- 1:nrow(df)
train_i <- sort(sample(all, round(nrow(df)*0.75,digits = 0),replace=FALSE))
test_i <- all[-train_i]

然后创建一个训练和测试数据框

df_train <- df[train_i,]
df_test <- df[test_i,]

score 2 · Accepted Answer

在 R 示例代码中使用 caTools 包如下：-

data
split = sample.split(data$DependentcoloumnName, SplitRatio = 0.6)
training_set = subset(data, split == TRUE)
test_set = subset(data, split == FALSE)

score 2 · Accepted Answer

require(caTools)

set.seed(101)            #This is used to create same samples everytime

split1=sample.split(data$anycol,SplitRatio=2/3)

train=subset(data,split1==TRUE)

test=subset(data,split1==FALSE)

该sample.split()函数将向数据框添加一个额外的列“split1”，并且 2/3 的行将具有此值为 TRUE，其他行将具有此值为 FALSE。现在 split1 为 TRUE 的行将被复制到训练中，其他行将被复制以进行测试数据框。

score 2 · Accepted Answer

我们可以将数据划分为特定的比例，这里是 80% 的训练数据和 20% 的测试数据集。

ind <- sample(2, nrow(dataName), replace = T, prob = c(0.8,0.2))
train <- dataName[ind==1, ]
test <- dataName[ind==2, ]

score 1 · Accepted Answer

我碰到了这个，它也可以提供帮助。

set.seed(12)
data = Sonar[sample(nrow(Sonar)),]#reshufles the data
bound = floor(0.7 * nrow(data))
df_train = data[1:bound,]
df_test = data[(bound+1):nrow(data),]

score 1 · Accepted Answer

sample如果您寻找可重复的结果，请注意拆分。如果您的数据发生轻微变化，即使您使用set.seed. 例如，假设您数据中的 ID 排序列表是 1 到 10 之间的所有数字。如果您只删除一个观察值，例如 4，则按位置抽样会产生不同的结果，因为现在 5 到 10 个所有位置都移动了。

另一种方法是使用散列函数将 ID 映射到一些伪随机数，然后对这些数字的 mod 进行采样。这个样本更稳定，因为分配现在由每个观察的哈希确定，而不是由其相对位置确定。

例如：

require(openssl)  # for md5
require(data.table)  # for the demo data

set.seed(1)  # this won't help `sample`

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
sample2 <- sample1[-sample(N, 1)]  # randomly drop one observation from sample1

# samples are all but identical
sample1
sample2
nrow(merge(sample1, sample2))

[1] 9999

# row splitting yields very different test sets, even though we've set the seed
test <- sample(N-1, N/2, replace = F)

test1 <- sample1[test, .(id)]
test2 <- sample2[test, .(id)]
nrow(test1)

[1] 5000

nrow(merge(test1, test2))

[1] 2653

# to fix that, we can use some hash function to sample on the last digit

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

# hash splitting preserves the similarity, because the assignment of test/train 
# is determined by the hash of each obs., and not by its relative location in the data
# which may change 
test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]
nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

样本大小不完全是 5000，因为分配是概率性的，但由于大数定律，在大样本中它不应该成为问题。

另见：http ://blog.richardweiss.org/2016/12/25/hash-splits.html 和https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when -计算模数

score 1 · Accepted Answer

创建一个索引行“rowid”并使用反连接过滤掉使用by =“rowid”。您可以在拆分后使用 %>% select(-rowid) 删除 rowid 列。

数据 <- tibble::rowid_to_column(data)

set.seed(11081995)

测试数据 <- 数据 %>% slice_sample(prop = 0.2)

traindata <- anti_join(data, testdata, by = "rowid")

score 0 · Accepted Answer

我更喜欢dplyr使用mutate价值观

set.seed(1)
mutate(x, train = runif(1) < 0.75)

我可以继续使用dplyr::filter辅助功能，例如

data.split <- function(is_train = TRUE) {
    set.seed(1)
    mutate(x, train = runif(1) < 0.75) %>%
    filter(train == is_train)
}

score 0 · Accepted Answer

如果我正在处理多个数据表并且不想重复代码，我编写了一个函数（我的第一个函数，因此它可能无法正常工作）以使其更快。

xtrain <- function(data, proportion, t1, t2){
  data <- data %>% rowid_to_column("rowid")
  train <- slice_sample(data, prop = proportion)
  assign(t1, train, envir = .GlobalEnv)
  test <- data %>% anti_join(as.data.frame(train), by = "rowid")
  assign(t2, test, envir = .GlobalEnv)
}

xtrain(iris, .80, 'train_set', 'test_set')

您需要加载 dplyr 和 tibble。这需要一个给定的数据集、您要用于采样的比例和两个对象名称。该函数创建表，然后将它们分配为全局环境中的对象。

score 0 · Accepted Answer

我认为这可以解决问题：

df = data.frame(read.csv("data.csv"))
# Split the dataset into 80-20
numberOfRows = nrow(df)
bound = as.integer(numberOfRows *0.8)
train=df[1:bound ,2]
test1= df[(bound+1):numberOfRows ,2]

score 0 · Accepted Answer

尝试使用idx <- sample(2, nrow(data), replace = TRUE, prob = c(0.75, 0.25))和使用提供的 id 来访问拆分数据training <- data[idx == 1,] testing <- data[idx == 2,]

score 0 · Accepted Answer

set.seed(123)
llwork<-sample(1:length(mydata),round(0.75*length(mydata),digits=0)) 
wmydata<-mydata[llwork, ]
tmydata<-mydata[-llwork, ]

score -2 · Accepted Answer

有一种非常简单的方法可以使用 R 索引来选择行数和列数。这可以让您在给定行数的情况下干净地拆分数据集——比如数据的第一个 80%。

在 R 中，所有行和列都被索引，因此 DataSetName[1,1] 是分配给“DataSetName”的第一列和第一行的值。我可以使用 [x,] 选择行，使用 [,x] 选择列

例如：如果我有一个名为“data”的数据集，有 100 行，我可以使用查看前 80 行

查看（数据[1:80，]）

以同样的方式，我可以选择这些行并使用以下方法对其进行子集化：

火车 = 数据[1:80,]

测试 = 数据[81:100,]

现在，我将数据分成两部分，无法重新采样。快捷方便。

r - 如何使用样本函数将数据拆分为训练/测试集

28 回答 28

解释

Related

Reference