1

我知道我可以使用 tm 包使用 Dictionary 函数来计算语料库中特定单词的出现次数:

require(tm)
data(crude)

dic <- Dictionary("crude")
tdm <- TermDocumentMatrix(crude, control = list(dictionary = dic, removePunctuation = TRUE))
inspect(tdm)

我想知道是否有一种工具可以向 Dictionary 提供正则表达式而不是固定单词?

有时词干可能不是我想要的(例如,我可能想找出拼写错误),所以我想做类似的事情:

dic <- Dictionary(c("crude", 
                    "\\bcrud[[:alnum:]]+"),
                    "\\bcrud[de]")

从而继续使用 tm 包的功能?

4

2 回答 2

3

我不确定您是否可以将正则表达式放入字典函数中,因为它只接受字符向量或术语文档矩阵。我建议的解决方法是使用正则表达式对术语文档矩阵中的术语进行子集化,然后进行字数统计:

# What I would do instead
tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE))
# subset the tdm according to the criteria
# this is where you can use regex
crit <- grep("cru", tdm$dimnames$Terms)
# have a look to see what you got
inspect(tdm[crit])
        A term-document matrix (2 terms, 20 documents)

    Non-/sparse entries: 10/30
    Sparsity           : 75%
    Maximal term length: 7 
    Weighting          : term frequency (tf)

             Docs
    Terms     127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543
      crucial   0   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0
      crude     2   0   2   3   0   2   0   0   0   0   5   2   0   2   0   0   0   2
             Docs
    Terms     704 708
      crucial   0   0
      crude     0   1
# and count the number of times that criteria is met in each doc
colSums(as.matrix(tdm[crit]))
127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708 
  2   0   2   3   0   2   2   0   0   0   5   2   0   2   0   0   0   2   0   1 
# count the total number of times in all docs
sum(colSums(as.matrix(tdm[crit])))
[1] 23

如果这不是您想要的,请继续编辑您的问题,以包含一些正确代表您的实际用例的示例数据,以及您所需输出的示例。

于 2013-08-22T14:45:13.250 回答
2

文本分析包quanteda允许使用正则表达式选择特征,如果您指定valuetype = "regex".

require(tm)
require(quanteda)
data(crude)

dfm(corpus(crude), keptFeatures = "^cru", valuetype = "regex", verbose = FALSE)
# Document-feature matrix of: 20 documents, 2 features.
# 20 x 2 sparse Matrix of class "dfmSparse"
#      features
# docs  crude crucial
#   127     2       0
#   144     0       0
#   191     2       0
#   194     3       0
#   211     0       0
#   236     2       0
#   237     0       2
#   242     0       0
#   246     0       0
#   248     0       0
#   273     5       0
#   349     2       0
#   352     0       0
#   353     2       0
#   368     0       0
#   489     0       0
#   502     0       0
#   543     2       0
#   704     0       0
#   708     1       0

另请参阅?selectFeatures

于 2015-11-23T00:15:46.953 回答