r - 大约同时进行文本匹配和更新

Question

我有一个作为 df1 的数据框，其中包含作为 University_name 的大学名称的列，并且有 500000 行。现在我有另一个数据框作为 df2 ，它包含 2 列作为 university_name 和 university_aliases 并且有 150 行。现在，我想将 university_aliases 列中的每个大学别名与 university_name_new 中的大学名称进行匹配。

df1$university_name 的样本

university of auckland
the university of auckland
university of warwick - warwick business school
unv of warwick
seneca college of applied arts and technology
seneca college
univ of auckland

df2 样本

University_Alias                  Univeristy_Name_new

univ of auckland                  university of auckland
universiry of auckland            university of auckland
auckland university               university of auckland
university of auckland            university of auckland
warwick university                university of warwick
warwick univercity                university of warwick
university of warwick             university of warwick
seneca college                    seneca college
unv of warwick                    university of warwick

我期待这样的输出

university of auckland
university of auckland
university of warwick
seneca college
seneca college

我正在使用以下代码，但它不起作用

 df$university_name[ grepl(df$university_name,df2$university_alias)] <- df2$university_name_new

score 1 · Accepted Answer

您可以使用sapplyandstr_extract来获得所需的结果。

 # create sample data
df1 <- data.frame(university_name = c('university of auckland',
                                      'the university of auckland',
                                      'university of warwick - warwick business school',
                                      'seneca college of applied arts and technology',
                                      'seneca college'), stringsAsFactors = F)

# these are values to match (from df2)
vals <- c('university of auckland','university of warwick','seneca college')

# get the output
df1$output <- sapply(df1$university_name, function(z)({

    f <- vals[complete.cases(str_extract(string = z, pattern = vals))]
    return(f)

}), USE.NAMES = F)

print(df1)

                                  university_name                 output
1                          university of auckland university of auckland
2                      the university of auckland university of auckland
3 university of warwick - warwick business school  university of warwick
4   seneca college of applied arts and technology         seneca college
5                                  seneca college         seneca college

更新：

根据我的理解，已经有了withdf2的一对一映射，所以问题归结为检查 df1 中是否不存在 university_alias，我们将其删除。university_aliasuniversity_name_new

# check values for university_alias in university_name
maps2 <- as.character(df2$university_alias[which(df2$university_alias %in% df1$university_name)])

# remove unmatched rows from df2
df3 <- df2[df2$university_alias %in% maps2,]

print(df3)
            university_alias    university_name_new
1           univ of auckland university of auckland
4     university of auckland university of auckland
8             seneca college         seneca college
9             unv of warwick  university of warwick

score 0 · Accepted Answer

你可以这样做

df2$University_Name_new[which(is.element(df2$University_Alias, df1$university_name))]
### which returns the following ####
[1] "university of auckland" "seneca college"

现在例如，在您提供的数据中the university of auckland是 indf1$university_name但不是 in df2$University_Alias，这就是为什么我们有以下内容：

> which(is.element(df2$University_Alias, df1$university_name))
[1] 4 8

确实，从df1$university_name，只有university of auckland和seneca college包含在中df2$University_Alias。

r - 大约同时进行文本匹配和更新

2 回答 2

Related

Reference