0

'正在尝试从我的数据框中完全删除特殊字符,例如“-”,“/”,“)”,“(”等。但是我的数据框仅包含一个观察值,因为它正在输入将用于生产的模型。我'已经为数据框明确定义了因子水平。

我尝试了以下方法:

sanitize_string <- function(string){
  gsub('\\s+', "_", string) %>%
    gsub("[(]", "_", .) %>%
    gsub("[)]", "_", .) %>%
    gsub("[/]", "_", .) %>%
    gsub("[-]", "_", .)}

接着:

 df <- as.data.frame(lapply(df, function(dataframe) sapply(dataframe, sanitize_string)), stringsAsFactors=FALSE)

但是当我这样做时,我失去了我的因子水平,它只是认为每个因子都有一个水平,这会在我尝试从我的模型中得到预测时导致问题,因为 sparse.model.matrix 需要 2 个或更多的水平每个因素,但真正在生产中,只会发送一个观察结果。

谢谢。

这是我的数据框:

 $ children_under16                : Factor w/ 2 levels "No","Yes": 1
 $ ft_employment_status            : Factor w/ 5 levels "Employed","Full-Time Education(Student)",..: 1
 $ fuel_type                       : Factor w/ 2 levels "D","P": 2
 $ homeowner                       : Factor w/ 2 levels "FALSE","TRUE": 2
 $ marital_status                  : Factor w/ 6 levels "Married","Separated",..: 1
 $ overnight_loc                   : Factor w/ 7 levels "In a private Driveway",..: NA
 $ usage_type                      : Factor w/ 3 levels "CLASS_1","SDPC",..: 1
 $ licence_type                    : Factor w/ 3 levels "UK","European",..: 1
 $ yad_relationship_to_policyholder: Factor w/ 8 levels "Spouse","No_YAD",..: 1
 $ A                          : Factor w/ 7 levels "1","2","5","3",..: 1
 $ B                          : Factor w/ 19 levels "C","E","Q","D",..: 1
 $ C                           : Factor w/ 63 levels "11","19","58",..: 1
 $ region                          : Factor w/ 12 levels "Yorkshire and The Humber",..: 1
 $ D                      : Factor w/ 28 levels "Semi-Detached Suburbia",..: 27
 $ E                   : Factor w/ 77 levels "Families in Terraces and Flats",..: 77
 $ F                 : Factor w/ 9 levels "Suburbanites",..: 1
 $ industry_band                   : Factor w/ 18 levels "13","14","15",..: 14
 $ occ_band_goco                   : Factor w/ 17 levels "0","1","2","3",..: 2
 $ transmission                    : Factor w/ 2 levels "A","M": 2
 $ vehicle_make                    : Factor w/ 19 levels "OTHER","AUDI",..: 1
 $ vehicle_type           : Factor w/ 17 levels "Mid Exec Saloon/Estate/Coupe",..: 1
 $ rural_urban                     : Factor w/ 19 levels "Urban major conurbation",..: 2
 $ water_company                   : Factor w/ 23 levels "Affinity Water",..: 23
 $ seats                           : Factor w/ 6 levels "-99","2","4",..: ```


4

1 回答 1

0

您可以清理levels因子,而不是列。这将保留级别所在的顺序——尽管如果您的清理采用两个不同的级别并使它们相同,则会产生错误。我只会做一个for循环:

for (i in 1:ncol(df)) {
  if(is.factor(df[[i]])) {
    levels(df[[i]]) = sanitize_string(levels(df[[i]]))
  }
}

我无法在您发布的结构上对此进行测试,但如果您有问题,请与我分享一些数据,dput()以便我可以复制/粘贴它(例如,dput(df[1:10, ])或其他一些说明问题的小子集),我会乐于测试和完善。

于 2020-03-13T13:54:49.780 回答