r - R DocumentTermMatrix 丢失的结果少于 100

Question

我正在尝试将语料库输入 DocumentTermMatrix（我简写为 DTM）以获取术语频率，但我注意到 DTM 不能保留所有术语，我不知道为什么！一探究竟：

A<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107")
B<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107")
C<-Corpus(VectorSource(c(A,B)))
inspect(C)

>A corpus with 2 text documents
>
>The metadata consists of 2 tag-value pairs and a data frame
>Available tags are:
>  create_date creator 
>Available variables in the data frame are:
>  MetaID 
>
>[[1]]
> 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107
>
>[[2]]
> 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107

到目前为止，一切都很好。

但是现在，我尝试将 C 输入 DTM，但它并没有从另一端出来！看：

> dtm<-DocumentTermMatrix(C)
> colnames(dtm)
>[1] "100" "101" "102" "103" "106" "107" "108" "109" "110"

哪里的所有结果都小于 100？或者它是某种2字符的东西？我也试过：

dtm<-DocumentTermMatrix(C,control=list(c(1,Inf)))

和

dtm<-TermDocumentMatrix(C,control=list(c(1,Inf)))

无济于事。是什么赋予了？

score 3 · Accepted Answer

如果您阅读?TermDocumentMatrix帮助页面，您会看到帮助页面control=中列出了其他选项?termFreq。

有一个 wordLengths 参数可以过滤矩阵中使用的单词的长度。它默认为c(3,Inf)排除两个字符的单词。尝试将值设置control=list(wordLengths=c(2,Inf)为包含那些简短的单词。（注意，传递控制参数时，应在列表中命名参数。）

r - R DocumentTermMatrix 丢失的结果少于 100

1 回答 1

Related

Reference