r - Find levels of a factors that appear more than once

Question

I have this dataframe:

data <- data.frame(countries=c(rep('UK', 5),
                           rep('Netherlands 1a', 5),
                           rep('Netherlands', 5),
                           rep('USA', 5), 
                           rep('spain', 5), 
                           rep('Spain', 5),
                           rep('Spain 1a', 5),
                           rep('spain 1a', 5)),
               var=rnorm(40))

            countries          var
1              UK  0.506232270
2              UK  0.976348808
3              UK -0.752151769
4              UK  1.137267199
5              UK -0.363406715
6  Netherlands 1a -0.800835463
7  Netherlands 1a  1.767724231
8  Netherlands 1a  0.810757929
9  Netherlands 1a -1.188975114
10 Netherlands 1a -0.763144245
11    Netherlands  0.428511920
12    Netherlands  0.835184425
13    Netherlands -0.198316780
14    Netherlands  1.108191193
15    Netherlands  0.946819500
16            USA  0.226786121
17            USA -0.466886468
18            USA -2.217910876
19            USA -0.003472937
20            USA -0.784264921
21          spain -1.418014562
22          spain  1.002412706
23          spain  0.472621627
24          spain -1.378960222
25          spain -0.197020702
26          Spain  1.197971896
27          Spain  1.227648883
28          Spain -0.253083684
29          Spain -0.076562960
30          Spain  0.338882352
31       Spain 1a  0.074459521
32       Spain 1a -1.136391220
33       Spain 1a -1.648418916
34       Spain 1a  0.277264011
35       Spain 1a -0.568411569
36       spain 1a  0.250151646
37       spain 1a -1.527885883
38       spain 1a -0.452190849
39       spain 1a  0.454168927
40       spain 1a  0.889401396

I want to be able to find levels of countries that appear in different forms more than once. Forms that levels of countries might appear in are:

lowercase, for example "spain"
titlecase, for example "Spain"
lowercase with a different word attached, for example "spain 1a"
titlecase with a different word attached, for example "Spain 1a"

So I need to function to return a vector listing levels countries that appear more than once. In data, the vector that should be returned is:

"Netherlands 1a", "Netherlands", "spain", "Spain", "spain 1a", "Spain 1a"

Is it possible to make a function that would return this vector?

score 0 · Accepted Answer

应满足所有要求的快速解决方案（假设国家/地区名称始终是您data$country条目的第一个元素）：

# Country substrings
country.substr <- sapply(strsplit(tolower(levels(data$countries)), " "), "[[", 1)
# Duplicated country substrings
country.substr.dupl <- duplicated(country.substr)

# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
  levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))

[1] "Netherlands"    "Netherlands 1a" "spain"          "Spain"          "spain 1a"       "Spain 1a"

更新：

假设国家名称并不总是出现在第一个位置，您需要应用我从这里采取的不同方法。请注意，我稍微修改了您的示例数据以澄清我在做什么：

data <- data.frame(countries=c(rep('United Kingdom', 5),
                               rep('united kingdom', 5),
                               rep('Netherlands', 5), 
                               rep('Netherlands 1a', 5),
                               rep('1a Netherlands', 5),
                               rep('USA', 5), 
                               rep('spain', 5), 
                               rep('Spain', 5),
                               rep('Spain 1a', 5),
                               rep('spain 1a', 5)),
                   var=rnorm(50))

现在让我们识别所有不包含任何数字的国家/地区子字符串。后续步骤保持不变。那是你需要的吗？

# Remove mixed numeric/alphabetic parts from country names
country.substr <- lapply(strsplit(tolower(levels(data$countries)), " "), function(i) {
  # Identify, paste and return alphabetic-only components
  tmp <- grep("^[[:alpha:]]*$", i)

  if (length(tmp) == 1)
    return(i[tmp])
  else
    return(paste(i[tmp], collapse = " ")) 
})

# Identify douplicated country names
country.substr.dupl <- duplicated(country.substr)

# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
  levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))

[1] "1a Netherlands" "Netherlands"    "Netherlands 1a" "spain"          "Spain"          "spain 1a"       "Spain 1a"       "united kingdom" "United Kingdom"

score 0 · Accepted Answer

为什么不使用grep？这个ignore.case论点正是你在这里所需要的。

> uch <- unique(as.character(data$countries))
> found <- sapply(seq(uch), function(i){
      if(!grepl("\\s|[0-9]", uch[i]))
          grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
  })
> ff <- found[sapply(found, function(x) length(x) > 1)]
> unique(unlist(ff))
# [1] "Netherlands 1a" "Netherlands"    "spain" 
# [4] "Spain"          "Spain 1a"       "spain 1a"

这是我的逻辑：将列的唯一因子级别作为字符向量。然后，将其与自身进行比较，仅查看那些不包含空格或数字的级别。grep会抓住那些，但反过来就更难了。然后，我们只找到唯一的匹配项。所以这里有一个函数和一个测试运行，

find.matches <- function(column)
{
    uch <- unique(as.character(column))
    found <- sapply(seq(uch), function(i){
        if(!grepl("\\s|[0-9]", uch[i]))
            grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
        })
    ff <- found[sapply(found, function(x) length(x) > 1)]
    unique(unlist(ff))
}

> dat <- data.frame(x = c("a", "a1", "a 1b", "c", "d"),
                    y = c("fac", "tor", "fac 1a", "tor1a", "fac"))
> sapply(dat, find.matches)
# $x
# [1] "a"    "a1"   "a 1b"
# 
# $y
# [1] "fac"    "fac 1a" "tor"    "tor1a"

r - Find levels of a factors that appear more than once

2 回答 2

Related

Reference