r - 改变因子水平——“f”中的未知水平——不能改变水平

Question

我有一个包含许多行业名称的因素。我需要将它们分解为主要类别和行业。例如，因为我允许受访者随心所欲地做出回应，所以我的级别数量夸大了（例如金融服务、金融服务、银行、金融）。因为这些案例不匹配，它们作为一个额外的级别出现，所以我试图用 forcats 折叠它们：

test <- fct_collapse(PrescreenF$Industry, Finance = c("Banking",
  "Corporate Finance", "Finance", "Financial", "financial services",
  "financial services", "Financial Services", "Financial services"),
  NULL = "H")

我收到一条警告，上面写着：“金融服务”未知。这非常令人沮丧，因为当我调用向量时，我可以看到它确实存在。我已经尝试复制和粘贴通话中的确切单词，重新编写它，似乎有隐藏字符阻止它被更改。

如何正确折叠这些值？

-> test$industry
Banking
Corporate Finance 
Finance Financial 
financial services
financial services 
Financial Services 
Financial services

当我去“重估”说最后一级“金融服务”时，它告诉我它是一个未知的字符串。

编辑 dput(x$industry) 的输出

structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 
4L, 3L, 3L, 3L, 5L, 7L, 8L, 9L, 10L, 11L, 12L, 12L, 13L, 14L, 
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 17L, 18L, 18L, 18L, 
18L, 19L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 25L, 26L, 27L, 28L
), .Label = c("", "{\"ImportId\":\"QID8_TEXT\"}", "Finance", 
"Financial ", "Financial services ", "Please indicate the industry you work in (e.g. technology, healthcare etc):", 
"Cleantech", "Delivery", "e-commerce/fashion", "Food", "Food & Bev", 
"Retail", "Service", "tech", "technology", "Technology", "IT, technology", 
"Software", "Technology ", "Tehcnology", "Consulting", "Digital advertising", 
"Education", "Higher education", "Technology, management consulting", 
"University professor; teaching, research and service", "Information Technology and Services", 
"mobile technology"), class = "factor")

编辑想通了。有些条款在结束后有额外的空格。例如，虽然当我调用 Prescreen$Industry 时，它会返回多个名称，如“Banking”和“Corporate Finance”，但它并没有告诉我在 level 后面有空格。银行业实际上是..“银行业”有一个在 R 中没有出现的不可见空间。如何确保这是可见的并且不会再次发生？

我可以在列中运行 len 函数吗？如果是这样，它是如何工作的？PrescreenF$Industry（“银行”）？

score 0 · Accepted Answer

如果“x”是你的dataframe

library(stringr)

x$industry <- as.character(x$industry)
x$industry <- str_trim(x$industry)
x$industry <- as.factor(x$industry)

然后你可以回去fct_collapse()简化你的因素。

r - 改变因子水平——“f”中的未知水平——不能改变水平

1 回答 1

Related

Reference