r - 在 R 中使用 Jaro-Winkler 模糊匹配进行文本挖掘

Question

我试图在 R 中进行一些距离匹配，并且正在努力实现可用的输出。

我有一个terms包含 5 个文本字符串的数据框，以及每个字符串的类别。我有第二个数据框notes，其中包含 10 个拼写错误的单词以及一个 NoteID。

我希望能够使用距离算法将我的 5 个terms和 10 个中的每一个进行比较，notes以尝试抓住简单的拼写错误。我努力了：

near_match<- subset(notes, jarowinkler(notes$word, terms$word) >0.9)

   NoteID    Note
5      e5 thought
10     e5   tough

和

jarowinkler(notes$word, terms$word)

[1] 0.8000000 0.7777778 0.8266667 0.8833333 0.9714286 0.8000000 0.8000000 0.8266667 0.8833333 0.9500000

第一个实例几乎是我需要的，它只是缺少terms导致匹配的单词。第二个返回 10 个分数，但我不确定算法是否依次检查 5 个terms中的每一个和 10 个notes中的每一个，并且只返回最接近的匹配项（最高分）。

jarowinkler()如果我想要的可以使用或有更好的选择，我该如何更改上述内容以实现我想要的输出？

我对 R 比较陌生，所以感谢任何帮助我进一步理解算法如何生成分数以及实现我想要的输出的方法是什么。

下面的示例数据框

谢谢

> notes
   NoteID    word
1      a1     hit
2      b2     hot
3      c3   shirt
4      d4    than
5      e5 thought
6      a1     hat
7      b2     get
8      c3   shirt
9      d4    than
10     e5   tough

> terms
  Category   word
1        a    hot
2        b    got
3        a   shot
4        d   that
5        c though

score 1 · Accepted Answer

你的data.frames：

notes<-data.frame(NoteID=c("a1","b2","c3","d4","e5","a1","b2","c3","d4","e5"),
                  word=c("hit","hot","shirt","than","thought","hat","get","shirt","that","tough"))
terms<-data.frame(Category=c("a","b","c","d","e"),
                  word=c("hot","got","shot","that","though"))

使用stringdistmatrix(package stringdist) 方法 "jw" (jarowinkler)

library(stringdist)
dist<-stringdistmatrix(notes$word,terms$word,method = "jw")
row.names(dist)<-as.character(notes$word)
colnames(dist)<-as.character(terms$word)

现在你有了所有的距离：

dist
              hot       got       shot       that     though
hit     0.2222222 0.4444444 0.27777778 0.27777778 0.50000000
hot     0.0000000 0.2222222 0.08333333 0.27777778 0.33333333
shirt   0.4888889 1.0000000 0.21666667 0.36666667 0.54444444
than    0.4722222 1.0000000 0.50000000 0.16666667 0.38888889
thought 0.3571429 0.5158730 0.40476190 0.40476190 0.04761905
hat     0.2222222 0.4444444 0.27777778 0.08333333 0.50000000
get     0.4444444 0.2222222 0.47222222 0.47222222 0.50000000
shirt   0.4888889 1.0000000 0.21666667 0.36666667 0.54444444
that    0.2777778 0.4722222 0.33333333 0.00000000 0.38888889
tough   0.4888889 0.4888889 0.51666667 0.51666667 0.05555556

找到更接近笔记的单词

output<-cbind(notes,word_close=terms[as.numeric(apply(dist, 1, which.min)),"word"],dist_min=apply(dist, 1, min))
output
       NoteID    word word_close   dist_min
    1      a1     hit        hot 0.22222222
    2      b2     hot        hot 0.00000000
    3      c3   shirt       shot 0.21666667
    4      d4    than       that 0.16666667
    5      e5 thought     though 0.04761905
    6      a1     hat       that 0.08333333
    7      b2     get        got 0.22222222
    8      c3   shirt       shot 0.21666667
    9      d4    that       that 0.00000000
    10     e5   tough     though 0.05555556

如果您只想在 word_close 中使用某个距离阈值（在本例中为 0.1）下的单词，您可以这样做：

output[output$dist_min>=0.1,c("word_close","dist_min")]<-NA
output
   NoteID    word word_close   dist_min
1      a1     hit       <NA>         NA
2      b2     hot        hot 0.00000000
3      c3   shirt       <NA>         NA
4      d4    than       <NA>         NA
5      e5 thought     though 0.04761905
6      a1     hat       that 0.08333333
7      b2     get       <NA>         NA
8      c3   shirt       <NA>         NA
9      d4    that       that 0.00000000
10     e5   tough     though 0.05555556

r - 在 R 中使用 Jaro-Winkler 模糊匹配进行文本挖掘

1 回答 1

Related

Reference