r - 缩放混合数据帧的训练和测试数据集中的所有数字列

Question

下面的代码对训练集和测试集进行缩放。由于 Col6 和 Col7 不得缩放，因此将它们从原始数据中删除以缩放训练集和测试集：

library(tidyverse)

Data_Frame <- data.frame(Col1 = c("A1", "A1", "A1", "A2", "A2", "A2", "A3", "A3", "A3"),
                         
                         Col2 = c("2011-03-11", "2014-08-21", "2016-01-17", "2017-06-30", "2018-07-11", "2018-11-28", "2019-09-04", "2020-02-29", "2020-07-12"),
                         
                         Col3 = c("2018-10-22", "2019-05-24", "2020-12-25", "2018-10-12", "2019-09-24", "2020-12-19", "2018-10-22", "2019-06-14", "2020-12-20"),
                         
                         Col4 = c(4, 12, 2, 1, 4, 4, 75, 4, 44),
                         
                         Col5 = c(7.81, 6.45, 3, 1, 3, 2, 5, 1, 2),
                         
                         Col6 = c(1, 1, 1, 1, 1, 1, 1, 1, 1),
                         
                         Col7 = c(2, 2, 2, 2, 2, 2, 2, 2, 2),
                         
                         Col8 = c(7.77, 6, 8.4, -11.23, 3.5, 7.2, 15, 100, 22.22))

# randomly split data in r
sample_size = floor(0.8*nrow(Data_Frame))
set.seed(777)
picked = sample(seq_len(nrow(Data_Frame)),size = sample_size)
Train_Set = Data_Frame[picked,]
Test_Set = Data_Frame[-picked,]

# Remove columns Col6 and Col7, which will not be scaled
Train <- Train_Set %>% dplyr::select(- c(Col6, Col7))
Test <- Test_Set %>% dplyr::select(- c(Col6, Col7))

# Scale Train, collect mean and sd to scale in Test
Train_Scale <- Train %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)
num_cols <- names(which(sapply(Train,is.numeric)))
scale_params <- attributes(scale(Train[,num_cols]))[c("scaled:center","scaled:scale")]

# Scale Test with the scales of Train
Test_Scale <- Test
Test_Scale[,num_cols] = scale(Test_Scale[,num_cols],center=scale_params[[1]],scale=scale_params[[2]])

试

varnames <- c('Col6', 'Col7')
index <- names(Train_Set) %in% varnames
Train_Scale_Check <- Train_Set[, !index] %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)

工作，但从数据框中删除 Col6 和 Col7。

和，

Train_Scale_Check <- Train_Set %>% dplyr::mutate_if(is.numeric, !index, ~scale(.) %>% as.vector)

引发以下错误：

Error: expecting a one sided formula, a function, or a function name.
Run `rlang::last_error()` to see where the error occurred.

rlang::last_error()
<error/rlang_error>
expecting a one sided formula, a function, or a function name.
Backtrace:
 1. dplyr::mutate_if(...)
 2. dplyr:::manip_if(...)
 3. dplyr:::as_fun_list(.funs, .env, ..., .caller = .caller)
 4. dplyr:::map(...)
 5. base::lapply(.x, .f, ...)
 6. dplyr:::FUN(X[[i]], ...)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
expecting a one sided formula, a function, or a function name.
Backtrace:
    x
 1. \-dplyr::mutate_if(...)
 2.   \-dplyr:::manip_if(...)
 3.     \-dplyr:::as_fun_list(.funs, .env, ..., .caller = .caller)
 4.       \-dplyr:::map(...)
 5.         \-base::lapply(.x, .f, ...)
 6.           \-dplyr:::FUN(X[[i]], ...)

有没有一种简单的方法可以在 Train_Set 和 Test_Set 数据集中保留 Col6 和 Col7，但不能对其进行缩放？将列 Col6 和 Col7 提取为单独的数据帧的冗长方法，使用顶部的代码并最终 cbind Col6 和 Col7 数据帧。

score 0 · Accepted Answer

以下解决了问题（感谢@27 φ9 的建议）

仅在所需列处缩放训练集（忽略 Col6 和 Col7）

varnames <- c('Col6', 'Col7')
index <- names(Train_Set) %in% varnames
Train_Scale <- Train_Set %>%  mutate(across(where(is.numeric) & -all_of(varnames), ~scale(.) %>% as.vector))

拿起秤：

num_cols <- names(which(sapply(subset(Train_Set, select=-c(Col6, Col7)), is.numeric)))
scale_params <- attributes(scale(Train_Set[,num_cols]))[c("scaled:center","scaled:scale")]

使用测试数据中的尺度：

Test_Scale <- Test_Set
Test_Scale[,num_cols] = scale(Test_Scale[,num_cols],center=scale_params[[1]],scale=scale_params[[2]])

r - 缩放混合数据帧的训练和测试数据集中的所有数字列

1 回答 1

Related

Reference