r - R文本挖掘-如何将R数据框列中的文本更改为具有二元频率的几列？

Question

除了问题R文本挖掘-如何将R数据框列中的文本更改为具有词频的几列？我想知道如何设法制作具有双连词频率的列，而不仅仅是词频。再次，提前非常感谢！

这是示例数据框（感谢 Tyler Rinker）。

      person sex adult                                 state code
1         sam   m     0         Computer is fun. Not too fun.   K1
2        greg   m     0               No it's not, it's dumb.   K2
3     teacher   m     1                    What should we do?   K3
4         sam   m     0                  You liar, it stinks!   K4
5        greg   m     0               I am telling the truth!   K5
6       sally   f     0                How can we be certain?   K6
7        greg   m     0                      There is no way.   K7
8         sam   m     0                       I distrust you.   K8
9       sally   f     0           What are you talking about?   K9
10 researcher   f     1         Shall we move on?  Good then.  K10
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11

上面的数据集：

library(qdap); DATA

score 2 · Accepted Answer

qdap（应该在接下来的几天内转到 CRAN）的开发版本执行 ngrams。现在你需要使用dev 版本。在玩具数据集上，这很快，但在较大的数据集（例如qdap'smraja1数据集）上需要约 5 分钟才能完成。你可以：

更明智地选择二元组（即，不要全部使用它们，因为会有很多）
等待时间
并行运行
找出另一种方法来做到这一点
获得更快的计算机

这是获取开发版本qdap并运行二元搜索的代码：

library(devtools)
install_github("qdap", "trinker")
library(qdap)

## this gets the bigrams
bigrams <- sapply(ngrams(DATA$state)[[c("all_n", "n_2")]], paste, collapse=" ")

## This searches by grouping variable for bigram use
termco(DATA$state, DATA$person, bigrams)


## To get raw values
termco(DATA$state, DATA$person, bigrams)[["raw"]]

r - R文本挖掘-如何将R数据框列中的文本更改为具有二元频率的几列？

1 回答 1

Related

Reference