'正在尝试从我的数据框中完全删除特殊字符,例如“-”,“/”,“)”,“(”等。但是我的数据框仅包含一个观察值,因为它正在输入将用于生产的模型。我'已经为数据框明确定义了因子水平。
我尝试了以下方法:
sanitize_string <- function(string){
gsub('\\s+', "_", string) %>%
gsub("[(]", "_", .) %>%
gsub("[)]", "_", .) %>%
gsub("[/]", "_", .) %>%
gsub("[-]", "_", .)}
接着:
df <- as.data.frame(lapply(df, function(dataframe) sapply(dataframe, sanitize_string)), stringsAsFactors=FALSE)
但是当我这样做时,我失去了我的因子水平,它只是认为每个因子都有一个水平,这会在我尝试从我的模型中得到预测时导致问题,因为 sparse.model.matrix 需要 2 个或更多的水平每个因素,但真正在生产中,只会发送一个观察结果。
谢谢。
这是我的数据框:
$ children_under16 : Factor w/ 2 levels "No","Yes": 1
$ ft_employment_status : Factor w/ 5 levels "Employed","Full-Time Education(Student)",..: 1
$ fuel_type : Factor w/ 2 levels "D","P": 2
$ homeowner : Factor w/ 2 levels "FALSE","TRUE": 2
$ marital_status : Factor w/ 6 levels "Married","Separated",..: 1
$ overnight_loc : Factor w/ 7 levels "In a private Driveway",..: NA
$ usage_type : Factor w/ 3 levels "CLASS_1","SDPC",..: 1
$ licence_type : Factor w/ 3 levels "UK","European",..: 1
$ yad_relationship_to_policyholder: Factor w/ 8 levels "Spouse","No_YAD",..: 1
$ A : Factor w/ 7 levels "1","2","5","3",..: 1
$ B : Factor w/ 19 levels "C","E","Q","D",..: 1
$ C : Factor w/ 63 levels "11","19","58",..: 1
$ region : Factor w/ 12 levels "Yorkshire and The Humber",..: 1
$ D : Factor w/ 28 levels "Semi-Detached Suburbia",..: 27
$ E : Factor w/ 77 levels "Families in Terraces and Flats",..: 77
$ F : Factor w/ 9 levels "Suburbanites",..: 1
$ industry_band : Factor w/ 18 levels "13","14","15",..: 14
$ occ_band_goco : Factor w/ 17 levels "0","1","2","3",..: 2
$ transmission : Factor w/ 2 levels "A","M": 2
$ vehicle_make : Factor w/ 19 levels "OTHER","AUDI",..: 1
$ vehicle_type : Factor w/ 17 levels "Mid Exec Saloon/Estate/Coupe",..: 1
$ rural_urban : Factor w/ 19 levels "Urban major conurbation",..: 2
$ water_company : Factor w/ 23 levels "Affinity Water",..: 23
$ seats : Factor w/ 6 levels "-99","2","4",..: ```