string - 字符串匹配以估计相似度

Question

我想分析一个 100 个字符长度的字段并估计相似度百分比。例如，对于同一个问题“您对智能手机有什么看法？”，

A 人： “浪费钱的最佳方式”

人 B： “很棒的东西。让您始终保持联系”

C 人： “浪费金钱和时间的工具”

其中，仅通过匹配单个单词，A 和 C 听起来很相似。我正在尝试做这样的事情，从R开始，然后扩展以匹配“最佳”、“最佳方式”、“最佳方式浪费”等单词的组合。我是文本分析和 R 的新手，不能正确命名这些方法以进行有效搜索。

请指导我您的意见和参考。提前致谢

score 4 · Accepted Answer

这是手动查看百分比相似度的潜在解决方案。

a <- "Best way to waste money"
b <- "Amazing stuff. lets you stay connected all the time"
c <- "Instrument to waste money and time"

format <- function(string1){ #removing the information from the string which presumably isn't important (punctuation, capital letters. then splitting all the words into separate strings)
  lower <- tolower(string1)
  no.punct <- gsub("[[:punct:]]", "", lower)
  split <- strsplit(no.punct, split=" ")
  return(split)
}

a <- format(a)
b <- format(b)
c <- format(c)

sim.per <- function(str1, str2, ...){#how similar is string 1 to string 2. NOTE: the order is important, ie. sim.per(b,c) is different from sim.per(c,b)
  sim <- length(intersect(str1[[1]], str2[[1]]))#intersect function counts the common strings
  total <- length(str1[[1]])
  per <- sim/total
  return(per)
}

#test
sim.per(b, c)

我希望这会有所帮助！要搜索单词组合，您必须做更多的魔法。我想尝试进行编辑以准确显示您正在寻找的内容，您可能会更幸运地得到答案！

至于参考资料，请查看 Gaston Sanchez 的“Handling and Processing Strings in R”，这很棒。

string - 字符串匹配以估计相似度

1 回答 1

Related

Reference