r - 使用 R 识别数据框中文本列的同义行

Question

假设 ABC 是一个数据框，如下所示：

ABC <- data.frame(Column1 = c(1.222, 3.445, 5.621, 8.501, 9.302), 
                  Column2 = c(654231, 12347, -2365, 90000, 12897), 
                  Column3 = c('A1', 'B2', 'E3', 'C1', 'F5'), 
                  Column4 = c('I bought it', 'The flower has a beautiful fragrance', 'It was bought by me', 'I have bought it', 'The flower smells good'), 
                  Column5 = c('Good', 'Bad', 'Ok', 'Moderate', 'Perfect'))

我的目的是在 Column4 中找到同义字符串。在这种情况下，我买了它，它是我买的，我买了它是同义词或相似的字符串，花有美丽的香味和花闻起来很好传达类似的意思。

我在以下线程中尝试了IVR的方法并卡住了：Find similar texts based on paraphrase detection

当我运行 HLS.Extract 代码块时，我收到以下错误消息：

Error in strsplit(PlainTextDocument(synonyms(word)), ",") : non-character Argument

使用 as.character 也不能解决问题：

Syns = function(word){  
    word <- as.character(word) ###
    wl    =   gsub("(.*[[:space:]].*)","",      
                   gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","",  
                        unlist(strsplit(PlainTextDocument(synonyms(word)),","))))
    wl = wl[wl!=""] 
    return(wl)     
  }

出了什么问题？
有没有更好的方法使用 R 对其进行编码，并另外创建一个新列，例如数字 1 作为第一个同义字符串的条目，2 作为下一组同义字符串的条目？
它适用于德语文本吗？

score 0 · Accepted Answer

通过将 PlainTextDocument(synonyms(word)) 设置为字符解决了该问题，如下所示：

Syns = function(word){ 
    wl    =   gsub("(.*[[:space:]].*)","",      
                   gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","",  
                        unlist(strsplit(as.character(PlainTextDocument(synonyms(word))),",")))) 
    wl = wl[wl!=""] 
    return(wl)     
  }

r - 使用 R 识别数据框中文本列的同义行

1 回答 1

Related

Reference