0

我有一个数据框,需要创建一个标志来指示两列之间存在部分匹配的实例,这里是代码和一些虚拟数据:

doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","veggies")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate veggies") 
mydata <- data.frame(doc_id,word,text,stringsAsFactors = FALSE)

预期的结果是相同的数据框,其中包含一个额外的列,显示单词和文本之间的匹配是否为部分匹配

doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","soup")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate soup") 
partial_match <- c("1","0","0","1","0","1","0")
mydata2 <- data.frame(doc_id,word,text,partial_match,stringsAsFactors = FALSE)

我试过了

str_detect(mydata$word, mydata$text)

以及使用诸如charmatch,pmatch,grep和grepl之类的功能但没有成功的类似事情。

真实数据包含数千条记录,因此解决方案应可扩展。

谢谢。

4

1 回答 1

1

经过长时间的尝试,我学到了更多关于字符串操作的知识并得到了它。可能不是最有效的方法,但它确实有效。

OBS:我用“¹”,“²”和“³”标记了评论,以便我稍后解释。

parcial.m = numeric() # Create an empty vector

for(i in 1:nrow(mydata2)){
  pattern = paste("([^\n]*)(",mydata2$word[i],")([^\n]*)",sep="")
  # ¹

  split = unlist(strsplit(mydata2$text[i], "[ [:punct:]]"))
  # Split the text by punctuation and spaces, i.e. by words

  word = grep(mydata2$word[i], split, value=TRUE)
  # Select only the 'original' word
  
  if(length(grep(mydata2$word[i], word))==0) {parcial.m[i]=0}
  # ²

  else {parcial.m[i] = !((gsub(pattern, "\\1" , word)=="") & (gsub(pattern, "\\3" , word)==""))}}
  # ³

¹:模式是:一组(由 标记(...))0 或更多(因此*)除换行符以外的任何字符(因此^\n,\n是换行符,^是除它之外的所有字符),后跟一个带有搜索词的组,第三个等于第一个。

²:如果根本没有匹配,我们没有得到部分匹配,所以我们想要一个 0 的值。我们通过使用以下事实来选择这些情况,grep(mydata2$word[i], word)当没有匹配时,将返回一个长度为 0 的数字。

³:选择图案的第 1 组和第 3 组"\\1"。如果它是一个完美的匹配,在我们“带走”搜索的单词(第 2 组)之后"\\3",(我称之为“原始单词”)不会有任何“剩余” ,所以第 1 组和第 3 组将为空(即 = ) . 那行代码正在测试两个组是否同时为空(完全匹配),并否定它(因此!)。由于我们已经用 if 语句将不匹配标记为 0,所以剩下的就是部分匹配。word""

于 2020-10-16T22:45:05.433 回答