4

我有一个数据集,其中有一列包含姓名,一列指示该人白天做了什么。我试图找出谁在那天使用 R 在我的数据集中遇到了谁。我创建了一个包含数据集中名称的向量,并在循环中使用 grepl 来识别名称出现在详细说明人们活动的列中的位置在数据集中。

name <- c("Dupont","Dupuy","Smith") 

activity <- c("On that day, he had lunch with Dupuy in London.", 
              "She had lunch with Dupont and then went to Brighton to meet Smith.", 
              "Smith remembers that he was tired on that day.")

met_with <- c("Dupont","Dupuy","Smith")

df<-data.frame(name, activity, met_with=NA)


for (i in 1:length(met_with)) {
df$met_with<-ifelse(grepl(met_with[i], df$activity), met_with[i], df$met_with)
}

然而,由于两个原因,该解决方案并不令人满意。当此人遇到多个其他人(例如 Dupuy 在我的示例中)时,我无法提取多个名称,并且我不能告诉 R 在使用该名称而不是代词时不要返回该人的姓名活动栏(例如史密斯)。

理想情况下,我希望 df 看起来像:

  name         activity                                            met_with                             
  Dupont       On that day, he had lunch with Dupuy in London.     Dupuy
  Dupuy        She had lunch with Dupont and then (...).           Dupont Smith
  Smith        Smith remembers that he was tired on that day.      NA

我正在清理字符串以构建边缘列表和节点列表,以便稍后进行网络分析。

谢谢

4

2 回答 2

1

与@Gki 相同的逻辑,但使用stringr函数而mapply不是循环。

library(stringr)

pat <- str_c('\\b', df$name, '\\b', collapse = '|')
df$met_with <- mapply(function(x, y) str_c(setdiff(x, y), collapse = ' '), 
       str_extract_all(df$activity, pat), df$name)

df

#    name                                                           activity
#1 Dupont                    On that day, he had lunch with Dupuy in London.
#2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith.
#3  Smith                     Smith remembers that he was tired on that day.

#      met_with
#1        Dupuy
#2 Dupont Smith
#3             
于 2021-07-07T13:31:40.943 回答
1

您可以使用setdiff排除与行匹配的名称,并使用gregexprandregmatches提取匹配的名称。也许也可以考虑给\\b周围的名字。

for(i in seq_len(nrow(df))) {
  df$met_with[i] <- paste(regmatches(df$activity[i],
   gregexpr(paste(setdiff(name, df$name[i]), collapse="|"),
   df$activity[i]))[[1]], collapse = " ")
}

df
#    name                                                           activity     met_with
#1 Dupont                    On that day, he had lunch with Dupuy in London.        Dupuy
#2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont Smith
#3  Smith                     Smith remembers that he was tired on that day.             

另一种使用方式Reduce可能是:

df$met_with <- Reduce(function(x, y) {
  i <- grepl(y, df$activity, fixed = TRUE) & y != df$name
  x[i] <- lapply(x[i], `c`, y)
  x
}, unique(name), vector("list", nrow(df)))

df
#    name                                                           activity      met_with
#1 Dupont                    On that day, he had lunch with Dupuy in London.         Dupuy
#2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont, Smith
#3  Smith                     Smith remembers that he was tired on that day.          NULL
于 2021-07-07T12:50:42.767 回答