r - 如何在 tm 字典中实现邻近规则来计算单词？

Question

客观的

我想计算“爱”这个词在文件中出现的次数，但前提是它前面没有“不”这个词，例如“我喜欢电影”将计为一次出现，而“我不喜欢电影” “不会算作出场。

问题

如何继续使用 tm 包？

代码

下面是一些我想修改以执行上述操作的自包含代码。

require(tm)

# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.", 
          "I do not love the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
          "I hate the `Red Hot Chilli Peppers`!")

# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)

# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))

# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)

# construct dictionary
my.dictionary.terms <- tolower(c("love", "Hate"))
my.dictionary <- Dictionary(my.dictionary.terms)

# construct the term document matrix
my.tdm <- TermDocumentMatrix(my.corpus, control = list(dictionary = my.dictionary))
inspect(my.tdm)

# Terms  positiveText neutralText negativeText
# hate            0           1            1
# love            2           1            0

更多信息

我正在尝试从商业包 WordStat 中重现字典规则功能。它能够利用字典规则，即

“由单词、单词模式、短语以及邻近规则（例如 NEAR、AFTER、BEFORE）组成的分层内容分析词典或分类法，用于实现概念的精确测量”

我还注意到这个有趣的 SO 问题：基于规则的开源模式匹配/信息提取框架？

更新 1：根据@Ben 的评论和帖子，我得到了这个（虽然最后略有不同，但他的回答强烈地启发了他，所以完全归功于他）

require(data.table)
require(RWeka)

# bi-gram tokeniser function
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

# get all 1-gram and 2-gram word counts
tdm <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))

# convert to data.table
dt <- as.data.table(as.data.frame(as.matrix(tdm)), keep.rownames=TRUE)
setkey(dt, rn)

# attempt at extracting but includes overlaps i.e. words counted twice 
dt[like(rn, "love")]
#            rn positiveText neutralText negativeText
# 1:     i love            1           0            0
# 2:       love            2           1            0
# 3: love peopl            1           0            0
# 4:   love the            1           1            0
# 5:  most love            1           0            0
# 6:   not love            0           1            0

然后我想我需要做一些行子设置和行减法，这会导致类似

dt1 <- dt["love"]
#     rn positiveText neutralText negativeText
#1: love            2           1            0

dt2 <- dt[like(rn, "love") & like(rn, "not")]
#         rn positiveText neutralText negativeText
#1: not love            0           1            0

# somehow do something like 
# DT = dt1 - dt2 
# but I can't work out how to code that but the require output would be
#     rn positiveText neutralText negativeText
#1: love            2           0            0

我不知道如何使用 data.table 获取最后一行，但这种方法类似于 WordStats 'NOT NEAR' 字典函数，例如在这种情况下，如果“love”这个词没有出现在 1 个单词中，则只计算它直接在“不”这个词之前或之后。

如果我们要做一个 m-gram 标记器，那么就像说我们只计算“爱”这个词，如果它没有出现在“不”这个词的任何一侧的 (m-1) 个词中。

其他方法是最受欢迎的！

score 1 · Accepted Answer

这是一个关于搭配提取的有趣问题，尽管它在语料库语言学中很受欢迎，但它似乎没有内置到任何包中（除了这个，不在 CRAN 或 github 上）。我认为这段代码会回答你的问题，但可能有比这更通用的解决方案。

这是您的示例（感谢易于使用的示例）

##############
require(tm)

# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.", 
             "I do not `love` the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
             "I hate the `Red Hot Chilli Peppers`!")

# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)

# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))

# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
# 'not' is a stopword so let's not remove that
# my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)

# construct dictionary - not used in this case
# my.dictionary.terms <- tolower(c("love", "Hate"))
# my.dictionary <- Dictionary(my.dictionary.terms)

这是我的建议，制作二元组的文档术语矩阵并将它们子集

#Tokenizer for n-grams and passed on to the term-document matrix constructor
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))
inspect(txtTdmBi)

# find bigrams that have 'love' in them
love_bigrams <- txtTdmBi$dimnames$Terms[grep("love", txtTdmBi$dimnames$Terms)]

# keep only bigrams where 'love' is not the first word
# to avoid counting 'love' twice and so we can subset 
# based on the preceeding word
require(Hmisc)
love_bigrams <- love_bigrams[sapply(love_bigrams, function(i) first.word(i)) != 'love']
# exclude the specific bigram 'not love'
love_bigrams <- love_bigrams[!love_bigrams == 'not love']

这是结果，我们得到 2 的“爱”，它排除了“不爱”二元组。

# inspect the results
inspect(txtTdmBi[love_bigrams])

A term-document matrix (2 terms, 3 documents)

Non-/sparse entries: 2/4
Sparsity           : 67%
Maximal term length: 9 
Weighting          : term frequency (tf)

           Docs
Terms       positiveText neutralText negativeText
  i love               1           0            0
  most love            1           0            0

# get counts of 'love' (excluding 'not love')
colSums(as.matrix(txtTdmBi[love_bigrams]))
positiveText  neutralText negativeText 
           2            0            0

score 0 · Accepted Answer

这听起来像极性。虽然我不会回答你提出的问题，但我可能会问你关于句子极性的更大问题。我已经polarity在 qdap 版本 1.2.0 中实现了可以执行此操作的功能，但是保存您要求的所有中间内容会使功能减慢太多。

library(qdap)
out <- apply_as_df(my.corpus, polarity, polarity.frame = POLENV)
lview(my.corpus)

df <- sentSplit(matrix2df(my.docs.df), "docs")

pols <- list(positives ="love", negatives="hate")
pols2 <- lapply(pols, function(x) term_match(df$docs, x, FALSE))
POLENV <- polarity_frame(positives =pols2[[1]], negatives=pols2[[2]])


output <- with(df, polarity(docs, var1, polarity.frame = POLENV))
counts(output)[, 1:5]

## > counts(output)[, 1:5]
##           var1 wc   polarity pos.words neg.words
## 1 positiveText  7  0.3779645      love         -
## 2 positiveText  9  0.3333333    lovely         -
## 3  neutralText 16  0.0000000      love      hate
## 4  neutralText  5  0.0000000         -         -
## 5 negativeText  7 -0.3779645         -      hate

data.frame(scores(output))[, 1:4]

##           var1 total.sentences total.words ave.polarity
## 1 negativeText               1           7   -0.3779645
## 2  neutralText               2          21    0.0000000
## 3 positiveText               2          16    0.3556489

r - 如何在 tm 字典中实现邻近规则来计算单词？

2 回答 2

Related

Reference