r - 在 R 中形成没有停用词的二元组

Question

我最近在使用 R 进行文本挖掘时遇到了一些问题。目的是在新闻中找到有意义的关键词，例如“智能汽车”和“数据挖掘”。

假设我有一个字符串如下：

"IBM have a great success in the computer industry for the past decades..."

删除停用词("have","a","in","the","for") 后，

"IBM great success computer industry past decades..."

结果，就会出现像“成功计算机”或“行业过去”这样的二元组。

但我真正需要的是两个词之间不存在停用词，比如“计算机行业”是我想要的二元组的一个明显例子。

我的代码部分如下：

corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, stemDocument)
NgramTokenizer = function(x) {unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)}
dtm <- TermDocumentMatrix(corpus, control = list(tokenize = NgramTokenizer))

TF计数时有什么方法可以避免出现“成功计算机”之类的结果吗？

score 3 · Accepted Answer

注意：在 2017 年 10 月 12 日编辑以反映新的 quanteda 语法。

您可以在quanteda中执行此操作，它可以在 ngram 形成后删除停用词。

txt <- "IBM have a great success in the computer industry for the past decades..."

library("quanteda")
myDfm <- tokens(txt) %>%
    tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
    tokens_remove(stopwords("english"), padding  = TRUE) %>%
    tokens_ngrams(n = 2) %>%
    dfm()

featnames(myDfm)
# [1] "great_success"     "computer_industry" "past_decades"

它能做什么：

形成令牌。
使用正则表达式删除标点符号，但在这些被删除的地方留下空白。这确保您不会使用从不相邻的标记来形成 ngram，因为它们被标点符号分隔。
删除停用词，同时在它们的位置留下填充物。
形成二元组。
构造文档特征矩阵。

要计算这些二元组，您可以直接检查 dfm，或使用topfeatures()：

myDfm
# Document-feature matrix of: 1 document, 3 features.
# 1 x 3 sparse Matrix of class "dfmSparse"
#        features
# docs    great_success computer_industry past_decades
#   text1             1                 1            1

topfeatures(myDfm)
#    great_success computer_industry      past_decades 
#                1                 1                 1

r - 在 R 中形成没有停用词的二元组

1 回答 1

Related

Reference