r - 使用R计算一对单词在文本文件中一起出现的次数

Question

我有一个函数可以接收文本文档和我想在该文本中找到的两个单词，并且我试图找出这两个单词在文本中彼此相邻出现的概率。所以我做的第一件事就是让它们成对。我的文档称为“words”，该函数接受 3 个参数：文档、word1 和 word2。我想弄清楚它们在文本中彼此相邻出现的次数。

pairs <- c()
  # Iterates through and creates every possible pair of adjacent words
  for (i in 1:(length(words)-1)) {
    temp <- paste(words[i],words[i+1], sep = ":") # Temporarily group adjacent words together with a : in between
    temp <- sort(strsplit(temp, ":")[[1]]) # Sort to get them lexically organized 
    pairs[i] <- paste(temp[1], temp[2], sep=":") # Store this pair in the list
  }

现在我正在尝试制作一个计数器来计算我的 2 个指定单词一起出现的次数。到目前为止，我已经尝试过了

pairs2<-0
    for(i in pairs){
    if(i==word1:word2|i==word2:word1){
    pairs2<-pairs2+1
    }

但我得到了错误

Error in word1:word2 : NA/NaN argument

我如何让 R 明白我希望这些 word1:word2 和 word2:word1 对中的每一个都是两个特定的词，当我有正确的组合时，在计数器上加一个 +1？

score 0 · Accepted Answer

这就是我要做的。假设您有一个单词向量，称为words：

library(dplyr)

# use lead from dplyr to create all pairs of adjacent words
word.pairs <- paste(words, lead(words), sep=":")

# use dplyr to sum up all pairs of words
word.pairs <- as.data.frame(word.pairs) %>%
  group_by(word.pairs) %>%
  summarise(Count = n())

这为您提供向量中每个单词对的计数。然后，您可以使用dplyr'sfilter()和arrange()函数对数据进行排序或找到感兴趣的特定单词对。例如，如果你想找到和的word1计数word2

word.pairs %>% filter(word.pairs == paste(word1, word2, sep=":"))

score 0 · Accepted Answer

如果您的文档被分解为单词对列表，则不需要 for 循环。

例如，如果您有一个字符串，例如：

test <- "hello my name is my name is tony"

您的函数将其分解为要制作的单词对列表：

pairs <- list("hello my", "my name", "name is", "is my", "my name", "name is", "is tony")

您可以通过以下方式获得“我的”和姓名一起出现的次数：

appearance <- length(pairs[pairs == "my name"|pairs == "name my"]) # 2

或者在你的情况下：

pairs2 <- length(pairs[pairs == paste(word1, word2) | pairs == paste(word2, word1)])

r - 使用R计算一对单词在文本文件中一起出现的次数

2 回答 2

Related

Reference